詹剑锋:Big databench—benchmarking big data systemshdhappy001
This document discusses BigDataBench, an open source project for big data benchmarking. BigDataBench includes six real-world data sets and 19 workloads that cover common big data applications and preserve the four V's of big data. The workloads were chosen to represent typical application domains like search engines, social networks, and e-commerce. BigDataBench aims to provide a standardized benchmark for evaluating big data systems, architectures, and software stacks. It has been used in several case studies for workload characterization and performance evaluation of different hardware platforms for big data workloads.
詹剑锋:Big databench—benchmarking big data systemshdhappy001
This document discusses BigDataBench, an open source project for big data benchmarking. BigDataBench includes six real-world data sets and 19 workloads that cover common big data applications and preserve the four V's of big data. The workloads were chosen to represent typical application domains like search engines, social networks, and e-commerce. BigDataBench aims to provide a standardized benchmark for evaluating big data systems, architectures, and software stacks. It has been used in several case studies for workload characterization and performance evaluation of different hardware platforms for big data workloads.
The document discusses big data visualization and visual analysis, focusing on the challenges and opportunities. It begins with an overview of visualization and then discusses several challenges in big data visualization, including integrating heterogeneous data from different sources and scales, dealing with data and task complexity, limited interaction capabilities for large data, scalability for both data and users, and the need for domain and development libraries/tools. It then provides examples of visualizing taxi GPS data and traffic patterns in Beijing to identify traffic jams.
Spark is an open source cluster computing framework originally developed at UC Berkeley. Intel has made many contributions to Spark's development through code commits, patches, and collaborating with the Spark community. Spark is widely used by companies like Alibaba, Baidu, and Youku for large-scale data analytics and machine learning tasks. It allows for faster iterative jobs than Hadoop through its in-memory computing model and supports multiple workloads including streaming, SQL, and graph processing.
This document describes an interactive batch query system for game analytics based on Apache Drill. It addresses the problem of answering common ad-hoc queries over large volumes of log data by using a columnar data model and optimizing query plans. The system utilizes Drill's schema-free data model and vectorized query processing. It further improves performance by merging similar queries, reusing intermediate results, and pushing execution downwards to utilize multi-core CPUs. This provides a unified solution for both ad-hoc and scheduled batch analytics workloads at large scale.
刘诚忠:Running cloudera impala on postgre sqlhdhappy001
This document summarizes a presentation about running Cloudera Impala on PostgreSQL to enable SQL queries on large datasets. Key points:
- The company processes 3 billion daily ad impressions and 20TB of daily report data, requiring a scalable SQL solution.
- Impala was chosen for its fast performance from in-memory processing and code generation. The architecture runs Impala coordinators and executors across clusters.
- The author hacked Impala to also scan data from PostgreSQL for mixed workloads. This involved adding new scan node types and metadata.
- Tests on a 150 million row dataset showed Impala with PostgreSQL achieving 20 million rows scanned per second per core.
This document discusses big data in the cloud and provides an overview of YARN. It begins with introducing the speaker and their experience with VMware and Apache Hadoop. The rest of the document covers: 1) trends in big data like the rise of YARN, faster query engines, and focus on enterprise capabilities, 2) how YARN addresses limitations of MapReduce by splitting responsibilities, 3) how YARN serves as a hub for various big data applications, and 4) how YARN can integrate with cloud infrastructure for elastic resource management between the two frameworks. The document advocates for open source contribution to help advance big data technologies.
5. 机架式服务器内存容量
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
CPU
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
2 ×CPU服务器,768 GB内存,内
存价格 $6,000
• 4 ×CPU服务器,1.5 TB内存,内
存价格$12,000
• 8 ×CPU服务器,3 TB 内存,内存
价格$24,000
服务器具备大容量内存的扩展能力,
价格已在可接受的范围之内,内存
时代已经来临。
R
A
M
4 IVY Bridge,1.5 TB内存
R
A
M
2 Intel IVY Bridge处理器,768 GB内存
R
A
M
12×32GB
R
A
M
12×32GB
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
CPU
CPU
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
CPU
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
CPU
CPU
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
CPU
R
A
M
R
A
M
R
A
M
CPU
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
QPI
R
A
M
R
A
M
R
A
M
CPU
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
RAM
RAM
QPI
R
A
M
RAM
RAM
CPU
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
RAM
R
A
M
RAM
RAM
RAM
RAM
RAM
RAM
R
A
M
R
A
M
RAM
R
A
M
R
A
M
R
A
M
R
A
M
RAM
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
RAM
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
CPU
R
A
M
CPU
R
A
M
QPI
CPU
CPU
8 IVY Bridge,3 TB内存
•
2014年间,DDR3内存价格将有13%的下降
DDR4内存价格将有10%的下降
Source: http://wccftech.com/intel-broadwell-supports-ddr4-memory-server-platforms-arriving-consumers-2014/
华东师范大学云计算与大数据研究中心
7. 具有充沛的计算能力
众核处理器架构
Logical
Processor 1
Arch states
(Registers)
Logical
Processor 2
Logical
Processor 1
Arch states
(Registers)
ALU
Arch states
(Registers)
…
Logical
Processor 2
Arch states
(Registers)
ALU
Cache(s)
Cache(s)
Core 1
Core n
Cache(s)
Source: http://2.bp.blogspot.com/-liLwtV_GT_o/T5CSRWJqxoI/AAAAAAAAAPk/6dEJ6kvyzzc/s0/IntelsMulticoreReality.png
NUMA RAM
单CPU,12核,24超线程,2.7Ghz时钟
频率已商用
• 单CPU具备的累积时钟频率:64.8Ghz
处理器技术已具备100核的扩展能力,但市
场依旧需保持单核的高频率(历史单线程
程序),但处理器已具备充沛的内存数据
处理能力,未来将更为富足
Number of Cores inside CPU
•
Source: In-memory data management: an inflection point for enterprise applications.
华东师范大学云计算与大数据研究中心
14. 通讯瓶颈实验
数据访问性能比较
数据记录大小影响
数据获取性能
硬件: 2CPUs, 16GB Memory,
Ethernet
1G bps
数据表: 4GB 表文件,包含可变长度的记
录,存储于本地磁盘和远端内存
Disk I/O Bottleneck: Vulnerable to the
random disk access
随机磁盘数据性能
干扰很大
Memory Wall: Vulnerable to the data
placement in the memory (partly because
of the length of the record)
Communication Wall: Limited network
bandwidth comparing to tremendous
large data movement in the cluster.
华东师范大学云计算与大数据研究中心