詹剑锋:Big databench—benchmarking big data systemshdhappy001
This document discusses BigDataBench, an open source project for big data benchmarking. BigDataBench includes six real-world data sets and 19 workloads that cover common big data applications and preserve the four V's of big data. The workloads were chosen to represent typical application domains like search engines, social networks, and e-commerce. BigDataBench aims to provide a standardized benchmark for evaluating big data systems, architectures, and software stacks. It has been used in several case studies for workload characterization and performance evaluation of different hardware platforms for big data workloads.
詹剑锋:Big databench—benchmarking big data systemshdhappy001
This document discusses BigDataBench, an open source project for big data benchmarking. BigDataBench includes six real-world data sets and 19 workloads that cover common big data applications and preserve the four V's of big data. The workloads were chosen to represent typical application domains like search engines, social networks, and e-commerce. BigDataBench aims to provide a standardized benchmark for evaluating big data systems, architectures, and software stacks. It has been used in several case studies for workload characterization and performance evaluation of different hardware platforms for big data workloads.
The document discusses big data visualization and visual analysis, focusing on the challenges and opportunities. It begins with an overview of visualization and then discusses several challenges in big data visualization, including integrating heterogeneous data from different sources and scales, dealing with data and task complexity, limited interaction capabilities for large data, scalability for both data and users, and the need for domain and development libraries/tools. It then provides examples of visualizing taxi GPS data and traffic patterns in Beijing to identify traffic jams.
Spark is an open source cluster computing framework originally developed at UC Berkeley. Intel has made many contributions to Spark's development through code commits, patches, and collaborating with the Spark community. Spark is widely used by companies like Alibaba, Baidu, and Youku for large-scale data analytics and machine learning tasks. It allows for faster iterative jobs than Hadoop through its in-memory computing model and supports multiple workloads including streaming, SQL, and graph processing.
This document describes an interactive batch query system for game analytics based on Apache Drill. It addresses the problem of answering common ad-hoc queries over large volumes of log data by using a columnar data model and optimizing query plans. The system utilizes Drill's schema-free data model and vectorized query processing. It further improves performance by merging similar queries, reusing intermediate results, and pushing execution downwards to utilize multi-core CPUs. This provides a unified solution for both ad-hoc and scheduled batch analytics workloads at large scale.
刘诚忠:Running cloudera impala on postgre sqlhdhappy001
This document summarizes a presentation about running Cloudera Impala on PostgreSQL to enable SQL queries on large datasets. Key points:
- The company processes 3 billion daily ad impressions and 20TB of daily report data, requiring a scalable SQL solution.
- Impala was chosen for its fast performance from in-memory processing and code generation. The architecture runs Impala coordinators and executors across clusters.
- The author hacked Impala to also scan data from PostgreSQL for mixed workloads. This involved adding new scan node types and metadata.
- Tests on a 150 million row dataset showed Impala with PostgreSQL achieving 20 million rows scanned per second per core.
This document discusses big data in the cloud and provides an overview of YARN. It begins with introducing the speaker and their experience with VMware and Apache Hadoop. The rest of the document covers: 1) trends in big data like the rise of YARN, faster query engines, and focus on enterprise capabilities, 2) how YARN addresses limitations of MapReduce by splitting responsibilities, 3) how YARN serves as a hub for various big data applications, and 4) how YARN can integrate with cloud infrastructure for elastic resource management between the two frameworks. The document advocates for open source contribution to help advance big data technologies.
Raghu nambiar:industry standard benchmarkshdhappy001
Industry standard benchmarks have played a crucial role in advancing the computing industry by enabling healthy competition that drives product improvements and new technologies. Major benchmarking organizations like TPC, SPEC, and SPC have developed numerous benchmarks over time to keep up with industry needs. Looking ahead, new benchmarks are needed to address emerging technologies like cloud, big data, and the internet of things. International conferences and workshops bring together experts to collaborate on developing these new, relevant benchmarks.