SlideShare a Scribd company logo
Tilmann Rabl 
Middleware Systems Research Group & bankmark UG 
ISC’14, June 26, 2014 
Crafting Benchmarks for Big Data 
MIDDLEWARESYSTEMS 
RESEARCH GROUP 
MSRG.ORG
Outline 
•Big Data Benchmarking Community 
•Our approach to building benchmarks 
•Big Data Benchmarks 
•Characteristics 
•BigBench 
•Big Decisions 
•Hammer 
•DAP 
•Slides borrowed from Chaitan Baru 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 2
Big Data Benchmarking Community 
•Genesis of the Big Data Benchmarking effort 
•Grant from NSF under the Cluster Exploratory (CluE) program (Chaitan Baru, SDSC) 
•Chaitan Baru (SDSC), Tilmann Rabl (University of Toronto), Milind Bhandarkar (Pivotal/Greenplum), Raghu Nambiar (Cisco), Meikel Poess (Oracle) 
•Launched Workshops on Big Data Benchmarking 
•First WBDB: May 2012, San Jose. Hosted by Brocade 
•Objectives 
•Lay the ground for development of industry standards for measuring the effectiveness of hardware and software technologies dealing with big data 
•Exploit synergies between benchmarking efforts 
•Offer a forum for presenting and debating platforms, workloads, data sets and metrics relevant to big bata 
•Big Data Benchmark Community (BDBC) 
•Regular conference calls for talks and announcements 
•Open to anyone interested, free of charge 
•BDBC makes no claims to any developments or ideas 
•clds.ucsd.edu/bdbc/community 
Crafting Benchmarks 26.06.2014 for Big Data - Tilmann Rabl 3
•Actian 
•AMD 
•BMMsoft 
•Brocade 
•CA Labs 
•Cisco 
•Cloudera 
•Convey Computer 
•CWI/Monet 
•Dell 
•EPFL 
•Facebook 
•Google 
•Greenplum 
•Hewlett-Packard 
•Hortonworks 
•Indiana Univ/ HathitrustResearch Foundation 
•InfoSizing 
•Intel 
•LinkedIn 
•MapR/Mahout 
•Mellanox 
•Microsoft 
•NSF 
•NetApp 
•NetApp/OpenSFS 
•Oracle 
•Red Hat 
•San Diego Supercomputer Center 
•SAS 
•Scripps Research Institute 
•Seagate 
•Shell 
•SNIA 
•Teradata Corporation 
•Twitter 
•UC Irvine 
•Univ. of Minnesota 
•Univ. of Toronto 
•Univ. of Washington 
•VMware 
•WhamCloud 
•Yahoo! 
1st WBDB: Attendee Organizations 
Crafting Benchmarks f 26.06.2014 or Big Data - Tilmann Rabl 4
Further Workshops 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 5 
2ndWBDB: http://clds.sdsc.edu/wbdb2012.in 
3rdWBDB: http://clds.sdsc.edu/wbdb2013.cn 
4thWBDB: http://clds.sdsc.edu/wbdb2013.us 
5thWBDB: http://clds.sdsc.edu/wbdb2014.de
First Outcomes 
•Big Data Benchmarking Community (BDBC) mailing list (~200 members from ~80 organizations) 
•Organized webinars every other Thursday 
•http://clds.sdsc.edu/bdbc/community 
•Paper from First WBDB 
•Setting the Direction for Big Data Benchmark Standards C. Baru, M. Bhandarkar, R. Nambiar, M. Poess, and T. Rabl, published in Selected Topics in Performance Evaluation and Benchmarking, Springer-Verlag 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 6
Further Outcomes 
•Selected papers in Springer Verlag, Lecture Notes in Computer Science, Springer Verlag 
•LNCS 8163: Specifying Big Data Benchmarks (covering the first and second workshops) 
•LNCS 8585: Advancing Big Data Benchmarks (covering the third and fourth workshops, in print) 
•Papers from 5thWBDB will be in VolIII 
•Formation of TPC Subcommittee on Big Data Benchmarking 
•Working on TPCx-HS: TPC Express benchmark for Hadoop Systems, based on Terasort 
•http://www.tpc.org/tpcbd/ 
•Formation of a SPEC Research Group on Big Data Benchmarking 
•Proposal of BigDataTop100 List 
•Specification of BigBench 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 7
TPC Big Data Subcommittee 
•TPCx-HS 
•TPC Express for Hadoop Systems 
•Based on Terasort 
•Teragen, Terasort, Teravalidate 
•Database size / Scale Factors 
•SF: 1, 3, 10, 30, 100, 300, 1000, 3000, 10000 TB 
•Corresponds to: 10B, 30B, 100B, 300B, 1000B, 3000B, 10000B, 30000B, 100000B 100-byte records 
•Performance Metric 
•HSph@SF= SF/T (total elapsed time in hours) 
•Price/Performance 
•$/HSph, $ is 3-year total cost of ownership 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 8
Formation of SPEC Research Big Data Working Group 
•Mission Statement 
ThemissionoftheBigData(BD)workinggroupistofacilitateresearchandtoengageindustryleadersfordefininganddevelopingperformancemethodologiesofbigdataapplications.Theterm‘‘bigdata’’hasbecomeamajorforceofinnovationacrossenterprisesofallsizes.Newplatforms,claimingtobethe“bigdata”platformwithincreasinglymorefeaturesformanagingbigdatasets,arebeingannouncedalmostonaweeklybasis.Yet,thereiscurrentlyalackofwhatconstitutesabigdatasystemandanymeansofcomparabilityamongsuchsystems. 
•Initial Committee Structure 
•Tilmann Rabl (Chair) 
•Chaitan Baru (Vice Chair) 
•Meikel Poess (Secretary) 
•To replace less formal BDBC group 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 9
BigDataTop100 List 
•Modeled after Top500 and Graph500 in HPC community 
•Proposal presented at Strata Conference, February 2013 
•Based on application-level benchmarking 
•Article in inaugural issue of the Big Data Journal 
•Big Data Benchmarking and the Big Data Top100 List by Baru, Bhandarkar, Nambiar, Poess, Rabl, Big Data Journal, Vol.1, No.1, 60-64, Anne LiebertPublications. 
•In progress 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 10
Big Data Benchmarks 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 11
Types of Big Data Benchmarks 
•Micro-benchmarks. To evaluate specific lower-level, system operations 
•E.g., A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Panda et al, OSU 
•Functional benchmarks. Specific high-level function. 
•E.g. Sorting: Terasort 
•E.g. Basic SQL: Individual SQL operations, e.g. Select, Project, Join, Order-By, … 
•Genre-specific benchmarks. Benchmarks related to type of data 
•E.g. Graph500. Breadth-first graph traversals 
•Application-level benchmarks 
•Measure system performance (hardware and software) for a given application scenario—with given data and workload 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 12
Application-Level Benchmark Design Issues from WBDB 
•Audience: Who is the audience for the benchmark? 
•Marketing (Customers / End users) 
•Internal Use (Engineering) 
•Academic Use (Research and Development) 
•Is the benchmark for innovation or competition? 
•If a competitive benchmark is successful, it will be used for innovation 
•Application: What type of application should be modeled? 
•TPC: schema + transaction/query workload 
•BigData: Abstractions of a data processing pipeline, e.g. Internet-scale businesses 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 13
App Level Issues -2 
•Component vs. end-to-end benchmark. Is it possible to factor out a set of benchmark “components”, which can be isolated and plugged into an end-to- end benchmark? 
•The benchmark should consist of individual components that ultimately make up an end-to- end benchmark 
•Single benchmark specification: Is it possible to specify a single benchmark that captures characteristics of multiple applications ? 
•Maybe: Create a single, multi-step benchmark, with plausible end-to-end scenario 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 14
App Level Issues -3 
•Paper & Pencil vs. Implementation-based. Should the implementation be specification-driven or implementation-driven? 
•Start with an implementation and develop specification at the same time 
•Reuse. Can we reuse existing benchmarks? 
•Leverage existing work and built-up knowledgebase 
•Benchmark Data. Where do we get the data from? 
•Synthetic data generation: structured, non-structured data 
•Verifiability. Should there be a process for verification of results? 
•YES! 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 15
Abstracting the Big Data World 
1.Enterprise Data Warehouse + Other types of data 
•Structured enterprise data warehouse 
•Extend to incorporate semi-structured data, e.g. from weblogs, machine logs, clickstream, customer reviews, … 
•“Design time” schemas 
2.Collection of heterogeneous data + Pipelines of processing 
•Enterprise data processing as a pipeline from data ingestion to transformation, extraction, subsetting, machine learning, predictive analytics 
•Data from multiple structured and non-structured sources 
•“Runtime” schemas –late binding, application-driven schemas 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 16
Other Benchmarks discussed at WBDB 
•Big Decision, Jimmy Zhao, HP 
•HiBench/Hammer, LanYi, Intel 
•BigDataBench, JianfengZhan, Chinese Academy of Sciences 
•CloudSuite, OnurKocberber, EPFL 
•Genre specific benchmarks 
•Microbenchmarks 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 17
The BigBenchProposal 
•End to end benchmark 
•Application level 
•Based on a product retailer (TPC-DS) 
•Focused on Parallel DBMS and MR engines 
•History 
•Launched at 1stWBDB, San Jose 
•Published at SIGMOD 2013 
•Full spec at WBDB proceedings 2012 
•Full kit at WBDB 2014 
•Collaboration with Industry & Academia 
•First: Teradata, University of Toronto, Oracle, InfoSizing 
•Now: UofT, bankmark, Intel, Oracle, Microsoft, UCSD, Pivotal, Cloudera, InfoSizing, SAP, Hortonworks, Cisco, … 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 18
Data Model 
19 
Unstructured Data 
Semi-Structured Data 
Structured Data 
Sales 
Customer 
Item 
Marketprice 
Web Page 
Web Log 
Reviews 
Adapted 
TPC-DS 
BigBench 
Specific 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl
•Variety 
•Different schema parts 
•Volume 
•Based on scale factor 
•Similar to TPC-DS scaling, but continuous 
•Weblogs & product reviews also scaled 
•Velocity 
•Refreshes for all data 
•Different velocity for different areas 
•Vstructured< Vunstructured< Vsemistructured 
Data Model –3 Vs 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 20
Workload 
•Workload Queries 
•30 “queries” 
•Specified in English (sort of) 
•No required syntax 
•Business functions (Adapted from McKinsey) 
•Marketing 
•Cross-selling, Customer micro-segmentation, Sentiment analysis, Enhancing multichannel consumer experiences 
•Merchandising 
•Assortment optimization, Pricing optimization 
•Operations 
•Performance transparency, Product return analysis 
•Supply chain 
•Inventory management 
•Reporting (customers and products) 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 21
SQL-MR Query 1 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 22
HiveQLQuery 1 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 23 
SELECTpid1, pid2, COUNT (*) AS cnt 
FROM ( 
FROM ( 
FROM ( 
SELECT s.ss_ticket_numberAS oid, s.ss_item_skAS pid 
FROM store_saless 
INNER JOIN item iON s.ss_item_sk= i.i_item_sk 
WHERE i.i_category_idin (1 ,2 ,3) and s.ss_store_skin (10 , 20, 33, 40, 50) 
) q01_temp_join 
MAP q01_temp_join.oid, q01_temp_join.pid 
USING 'cat' 
AS oid, pid 
CLUSTER BY oid 
) q01_map_output 
REDUCE q01_map_output.oid, q01_map_output.pid 
USING 'java -cpbigbenchqueriesmr.jar:hive-contrib.jarde.bankmark.bigbench.queries.q01.Red' 
AS (pid1 BIGINT, pid2 BIGINT) 
) q01_temp_basket 
GROUP BY pid1, pid2 
HAVING COUNT (pid1) > 49 
ORDER BY pid1, cnt, pid2;
BigBench Current Status 
•All queries are available in Hive/Hadoop 
•New data generator (continuous scaling, realistic data) available 
•New metric available 
•Complete driver available 
•Refresh will be done soon 
•Full kit at WBDB 2014 
•https://github.com/intel-hadoop/Big-Bench 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 24 
Query Types 
Number of Queries 
Percentage 
Pure HiveQL 
14 
46% 
Mahout 
5 
17% 
OpenNLP 
5 
17% 
Custom MR 
6 
20%
Big Decision, Jimmy Zhao, HP, 4thWBDB 
•Benchmark for A DSS/Data Mining solutions 
•Everything running in the same system 
•Engine of Analytics 
•Reflecting the real business model 
•Huge data volume 
•Data from Social 
•Data from Web log 
•Data from Comments 
•Broader Data support 
•Semi-structured data 
•Un-structured data 
•Continuous Data Integration 
•ETL just a normal job of the system 
•Data Integration whenever there’s data 
•Big Data Analytics 
Big Decision –Big TPC-DS! 
TPC-DS 
•Mature and proved workload for BI 
•Mix workloads 
•Well defined scale factors 
Semi + unstructured TPC-DS 
•Additional data and dimension from new data 
•Semi-structured and unstructured data 
•TB to PB or even Zeta Byte support 
NEW TPC-DS generator –Agile ETL 
•Continuously data generation and injection 
•Consider as part of the workloads 
New massive parallel processing technologies 
•Convert queries to SQL liked queries 
•Include interactive & regular Queries 
•Include Machine Learning jobs 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 25
Agile ETL 
Marketing 
Big Decision Block Diagram 
SNS Marketing 
TPC-DS 
Web page 
Sales 
Web log 
Item 
Reviews 
Social Message 
Search & Social Advertise 
Search 
Social Advertise 
Social Web pages 
Extraction 
Transform 
Load 
Customer 
Social Feedbacks 
Mobile log 
Mobile log 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 26
HiBench, LanYi, Intel, 4thWBDB 
26.06.2014 
HiBench 
–Enhanced DFSIO 
Micro Benchmarks 
Web Search 
–Sort 
–WordCount 
–TeraSort 
–Nutch Indexing 
–Page Rank 
Machine Learning 
–Bayesian Classification 
–K-Means Clustering 
HDFSSee our paper “The HiBenchSuite: Characterization of the MapReduce-Based Data Analysis” in ICDE’10 workshops (WISS’10) 
1.Different from GridMix, SWIM? 
2.Micro Benchmark? 
3.Isolated components? 
4.End-2-end Benchmark? 
5.We need ETL-Recommendation Pipeline 
Crafting Benchmarks for Big Data - Tilmann Rabl 27
ETL-Recommendation (hammer) 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 28
ETL-Recommendation (hammer) 
•Task Dependences 
Pref-logs 
ETL-logs 
Pref-sales 
Item based Collaborative Filtering 
Pref-comb 
ETL-sales 
Offline test 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 29
The Deep Analytics Pipeline, Bhandarkar (1stWBDB) 
•“User Modeling” pipelines 
•Generic use case: Determine user interests or user categories by mining user activities 
•Large dimensionality of possible user activities 
•Typical user represents a sparse activity vector 
•Event attributes change over time 
30 
Data Acquisition/ 
Normalization/ Sessionization 
Feature andTarget Generation 
Model Training 
Offline Scoring & Evaluation 
Batch Scoring & Upload toServer 
Acquisition/ 
Recording 
Extraction/ 
Cleaning/ 
Annotation 
Integration/ Aggregation/ 
Representation 
Analysis/ 
Modeling 
Interpretation 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl
Example Application Domains 
•Retail 
•Events: clicks on purchases, ad clicks, FB likes, … 
•Goal: Personalized product recommendations 
•Datacenters 
•Events: log messages, traffic, communications events, … 
•Goal: Predict imminent failures 
•Healthcare 
•Events: Doctor visits, medical history, medicine refills, … 
•Goal: Prevent hospital readmissions 
•Telecom 
•Events: Calls made, duration, calls dropped, location, social graph, … 
•Goal: Reduce customer churn 
•Web Ads 
•Events: Clicks on content, likes, reposts, search queries, comments, … 
•Goal: Increase engagement, increase clicks on revenue-generation content 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 31
Steps in the Pipeline 
•Acquisition and normalization of data 
•Collate, consolidate data 
•Join targets and features 
•Construct targets; filter out user activity without targets; join feature vector with targets 
•Model Training 
•Multi-model: regressions, Naïve Bayes, decision trees, Support Vector Machines, … 
•Offline scoring 
•Score features, evaluate metrics 
•Batch scoring 
•Apply models to all user activity; upload scores 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 32
Application Classes 
•Widely varying number of events per entity 
•Multiple classes of applications, based on size, e.g.: 
•Tiny (100K entities, 10 events per entity) 
•Small (1M entities, 10 events per entity) 
•Medium (10M entities, 100 events per entity) 
•Large (100M entities, 1000 events per entity) 
•Huge (1B entities, 1000 events per entity) 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 33
Proposal for Pipeline Benchmark Results 
•Publish results for every stage in the pipeline 
•Data pipelines for different application domains may be constructed by mix and match of various pipeline stages 
•Different modeling techniques per class 
•So, need to publish performance numbers for every stage 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 34
Get involved 
•Workshop on Big Data Benchmarking (WBDB) 
•Fifth workshop: August 6-7, Potsdam, Germany 
•clds.ucsd.edu/wbdb2014.de 
•Proceedings will be published in Springer LNCS 
•Big Data Benchmarking Community 
•Biweekly conference calls (sort of) 
•Mailing list 
•clds.ucsd.edu/bdbc/community 
•Coming up next: BDBC@SPEC Research 
•We will join forces with SPEC Research 
•Try BigBench: 
•https://github.com/intel-hadoop/Big-Bench 
26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 35
Questions? 
Crafting Benchmarks for Big Data - Tilmann Rabl 36 
Contact: 
Tilmann Rabl 
tilmann.rabl@bankmark.de 
tilmann.rabl@utoronto.ca 
26.06.2014 
MIDDLEWARESYSTEMS 
RESEARCH GROUP 
MSRG.ORG 
Thank You!

More Related Content

What's hot

Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
inside-BigData.com
 
ISNCC 2017
ISNCC 2017ISNCC 2017
ISNCC 2017
Rim Moussa
 
SQL Server Managing Test Data & Stress Testing January 2011
SQL Server Managing Test Data & Stress Testing January 2011SQL Server Managing Test Data & Stress Testing January 2011
SQL Server Managing Test Data & Stress Testing January 2011
Mark Ginnebaugh
 
Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622Anthony Potappel
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
hdhappy001
 
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
On Performance Under Hotspots in Hadoop versus Bigdata Replay PlatformsOn Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
Tokyo University of Science
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
Geoffrey Fox
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
IT Strategy Group
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoop
Maulik Thaker
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data Security
DataWorks Summit
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
Geoffrey Fox
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEWShiyong Lu
 
How to use flash drives with Apache Hadoop 3.x: Real world use cases and proo...
How to use flash drives with Apache Hadoop 3.x: Real world use cases and proo...How to use flash drives with Apache Hadoop 3.x: Real world use cases and proo...
How to use flash drives with Apache Hadoop 3.x: Real world use cases and proo...
DataWorks Summit
 
Exploiting machine learning to keep Hadoop clusters healthy
Exploiting machine learning to keep Hadoop clusters healthyExploiting machine learning to keep Hadoop clusters healthy
Exploiting machine learning to keep Hadoop clusters healthy
DataWorks Summit
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle Database
Gwen (Chen) Shapira
 
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Alluxio, Inc.
 
Modern data warehouse presentation
Modern data warehouse presentationModern data warehouse presentation
Modern data warehouse presentation
David Rice
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
My other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionMy other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 edition
Steve Loughran
 

What's hot (20)

Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
ISNCC 2017
ISNCC 2017ISNCC 2017
ISNCC 2017
 
SQL Server Managing Test Data & Stress Testing January 2011
SQL Server Managing Test Data & Stress Testing January 2011SQL Server Managing Test Data & Stress Testing January 2011
SQL Server Managing Test Data & Stress Testing January 2011
 
Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
 
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
On Performance Under Hotspots in Hadoop versus Bigdata Replay PlatformsOn Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Ss eb29
Ss eb29Ss eb29
Ss eb29
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoop
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data Security
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
 
How to use flash drives with Apache Hadoop 3.x: Real world use cases and proo...
How to use flash drives with Apache Hadoop 3.x: Real world use cases and proo...How to use flash drives with Apache Hadoop 3.x: Real world use cases and proo...
How to use flash drives with Apache Hadoop 3.x: Real world use cases and proo...
 
Exploiting machine learning to keep Hadoop clusters healthy
Exploiting machine learning to keep Hadoop clusters healthyExploiting machine learning to keep Hadoop clusters healthy
Exploiting machine learning to keep Hadoop clusters healthy
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle Database
 
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
 
Modern data warehouse presentation
Modern data warehouse presentationModern data warehouse presentation
Modern data warehouse presentation
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
My other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionMy other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 edition
 

Similar to Crafting bigdatabenchmarks

Current Trends and Challenges in Big Data Benchmarking
Current Trends and Challenges in Big Data BenchmarkingCurrent Trends and Challenges in Big Data Benchmarking
Current Trends and Challenges in Big Data Benchmarking
eXascale Infolab
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking Tutorial
Tilmann Rabl
 
Big data
Big dataBig data
Big data
nikki135
 
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015 Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015
Vladi Vexler
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
Mithlesh Sadh
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
Kristian Alexander
 
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
israel edem
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
SoftServe
 
Accelerating Big Data & Analytics Innovations through Public – Private Partne...
Accelerating Big Data & Analytics Innovations through Public – Private Partne...Accelerating Big Data & Analytics Innovations through Public – Private Partne...
Accelerating Big Data & Analytics Innovations through Public – Private Partne...
Prof. Dr. Alexander Maedche
 
Easy SPARQLing for the Building Performance Professional
Easy SPARQLing for the Building Performance ProfessionalEasy SPARQLing for the Building Performance Professional
Easy SPARQLing for the Building Performance Professional
Martin Kaltenböck
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of Hadoop
Inside Analysis
 
Marlabs Capabilities Overview: DWBI, Analytics and Big Data Services
Marlabs Capabilities Overview: DWBI, Analytics and Big Data ServicesMarlabs Capabilities Overview: DWBI, Analytics and Big Data Services
Marlabs Capabilities Overview: DWBI, Analytics and Big Data Services
Marlabs
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Infochimps, a CSC Big Data Business
 
Government GraphSummit: And Then There Were 15 Standards
Government GraphSummit: And Then There Were 15 StandardsGovernment GraphSummit: And Then There Were 15 Standards
Government GraphSummit: And Then There Were 15 Standards
Neo4j
 
Bigdata-Intro.pptx
Bigdata-Intro.pptxBigdata-Intro.pptx
Bigdata-Intro.pptx
smitasatpathy2
 
Pr dc 2015 sql server is cheaper than open source
Pr dc 2015 sql server is cheaper than open sourcePr dc 2015 sql server is cheaper than open source
Pr dc 2015 sql server is cheaper than open source
Terry Bunio
 
May_Planning_20140528-Final3
May_Planning_20140528-Final3May_Planning_20140528-Final3
May_Planning_20140528-Final3Zack Chang
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Robert Grossman
 
How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018
ARDC
 

Similar to Crafting bigdatabenchmarks (20)

Current Trends and Challenges in Big Data Benchmarking
Current Trends and Challenges in Big Data BenchmarkingCurrent Trends and Challenges in Big Data Benchmarking
Current Trends and Challenges in Big Data Benchmarking
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking Tutorial
 
Big data
Big dataBig data
Big data
 
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015 Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
 
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
Accelerating Big Data & Analytics Innovations through Public – Private Partne...
Accelerating Big Data & Analytics Innovations through Public – Private Partne...Accelerating Big Data & Analytics Innovations through Public – Private Partne...
Accelerating Big Data & Analytics Innovations through Public – Private Partne...
 
Easy SPARQLing for the Building Performance Professional
Easy SPARQLing for the Building Performance ProfessionalEasy SPARQLing for the Building Performance Professional
Easy SPARQLing for the Building Performance Professional
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of Hadoop
 
Marlabs Capabilities Overview: DWBI, Analytics and Big Data Services
Marlabs Capabilities Overview: DWBI, Analytics and Big Data ServicesMarlabs Capabilities Overview: DWBI, Analytics and Big Data Services
Marlabs Capabilities Overview: DWBI, Analytics and Big Data Services
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 
Government GraphSummit: And Then There Were 15 Standards
Government GraphSummit: And Then There Were 15 StandardsGovernment GraphSummit: And Then There Were 15 Standards
Government GraphSummit: And Then There Were 15 Standards
 
Bigdata-Intro.pptx
Bigdata-Intro.pptxBigdata-Intro.pptx
Bigdata-Intro.pptx
 
Pr dc 2015 sql server is cheaper than open source
Pr dc 2015 sql server is cheaper than open sourcePr dc 2015 sql server is cheaper than open source
Pr dc 2015 sql server is cheaper than open source
 
May_Planning_20140528-Final3
May_Planning_20140528-Final3May_Planning_20140528-Final3
May_Planning_20140528-Final3
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018
 

Recently uploaded

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 

Recently uploaded (20)

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 

Crafting bigdatabenchmarks

  • 1. Tilmann Rabl Middleware Systems Research Group & bankmark UG ISC’14, June 26, 2014 Crafting Benchmarks for Big Data MIDDLEWARESYSTEMS RESEARCH GROUP MSRG.ORG
  • 2. Outline •Big Data Benchmarking Community •Our approach to building benchmarks •Big Data Benchmarks •Characteristics •BigBench •Big Decisions •Hammer •DAP •Slides borrowed from Chaitan Baru 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 2
  • 3. Big Data Benchmarking Community •Genesis of the Big Data Benchmarking effort •Grant from NSF under the Cluster Exploratory (CluE) program (Chaitan Baru, SDSC) •Chaitan Baru (SDSC), Tilmann Rabl (University of Toronto), Milind Bhandarkar (Pivotal/Greenplum), Raghu Nambiar (Cisco), Meikel Poess (Oracle) •Launched Workshops on Big Data Benchmarking •First WBDB: May 2012, San Jose. Hosted by Brocade •Objectives •Lay the ground for development of industry standards for measuring the effectiveness of hardware and software technologies dealing with big data •Exploit synergies between benchmarking efforts •Offer a forum for presenting and debating platforms, workloads, data sets and metrics relevant to big bata •Big Data Benchmark Community (BDBC) •Regular conference calls for talks and announcements •Open to anyone interested, free of charge •BDBC makes no claims to any developments or ideas •clds.ucsd.edu/bdbc/community Crafting Benchmarks 26.06.2014 for Big Data - Tilmann Rabl 3
  • 4. •Actian •AMD •BMMsoft •Brocade •CA Labs •Cisco •Cloudera •Convey Computer •CWI/Monet •Dell •EPFL •Facebook •Google •Greenplum •Hewlett-Packard •Hortonworks •Indiana Univ/ HathitrustResearch Foundation •InfoSizing •Intel •LinkedIn •MapR/Mahout •Mellanox •Microsoft •NSF •NetApp •NetApp/OpenSFS •Oracle •Red Hat •San Diego Supercomputer Center •SAS •Scripps Research Institute •Seagate •Shell •SNIA •Teradata Corporation •Twitter •UC Irvine •Univ. of Minnesota •Univ. of Toronto •Univ. of Washington •VMware •WhamCloud •Yahoo! 1st WBDB: Attendee Organizations Crafting Benchmarks f 26.06.2014 or Big Data - Tilmann Rabl 4
  • 5. Further Workshops 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 5 2ndWBDB: http://clds.sdsc.edu/wbdb2012.in 3rdWBDB: http://clds.sdsc.edu/wbdb2013.cn 4thWBDB: http://clds.sdsc.edu/wbdb2013.us 5thWBDB: http://clds.sdsc.edu/wbdb2014.de
  • 6. First Outcomes •Big Data Benchmarking Community (BDBC) mailing list (~200 members from ~80 organizations) •Organized webinars every other Thursday •http://clds.sdsc.edu/bdbc/community •Paper from First WBDB •Setting the Direction for Big Data Benchmark Standards C. Baru, M. Bhandarkar, R. Nambiar, M. Poess, and T. Rabl, published in Selected Topics in Performance Evaluation and Benchmarking, Springer-Verlag 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 6
  • 7. Further Outcomes •Selected papers in Springer Verlag, Lecture Notes in Computer Science, Springer Verlag •LNCS 8163: Specifying Big Data Benchmarks (covering the first and second workshops) •LNCS 8585: Advancing Big Data Benchmarks (covering the third and fourth workshops, in print) •Papers from 5thWBDB will be in VolIII •Formation of TPC Subcommittee on Big Data Benchmarking •Working on TPCx-HS: TPC Express benchmark for Hadoop Systems, based on Terasort •http://www.tpc.org/tpcbd/ •Formation of a SPEC Research Group on Big Data Benchmarking •Proposal of BigDataTop100 List •Specification of BigBench 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 7
  • 8. TPC Big Data Subcommittee •TPCx-HS •TPC Express for Hadoop Systems •Based on Terasort •Teragen, Terasort, Teravalidate •Database size / Scale Factors •SF: 1, 3, 10, 30, 100, 300, 1000, 3000, 10000 TB •Corresponds to: 10B, 30B, 100B, 300B, 1000B, 3000B, 10000B, 30000B, 100000B 100-byte records •Performance Metric •HSph@SF= SF/T (total elapsed time in hours) •Price/Performance •$/HSph, $ is 3-year total cost of ownership 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 8
  • 9. Formation of SPEC Research Big Data Working Group •Mission Statement ThemissionoftheBigData(BD)workinggroupistofacilitateresearchandtoengageindustryleadersfordefininganddevelopingperformancemethodologiesofbigdataapplications.Theterm‘‘bigdata’’hasbecomeamajorforceofinnovationacrossenterprisesofallsizes.Newplatforms,claimingtobethe“bigdata”platformwithincreasinglymorefeaturesformanagingbigdatasets,arebeingannouncedalmostonaweeklybasis.Yet,thereiscurrentlyalackofwhatconstitutesabigdatasystemandanymeansofcomparabilityamongsuchsystems. •Initial Committee Structure •Tilmann Rabl (Chair) •Chaitan Baru (Vice Chair) •Meikel Poess (Secretary) •To replace less formal BDBC group 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 9
  • 10. BigDataTop100 List •Modeled after Top500 and Graph500 in HPC community •Proposal presented at Strata Conference, February 2013 •Based on application-level benchmarking •Article in inaugural issue of the Big Data Journal •Big Data Benchmarking and the Big Data Top100 List by Baru, Bhandarkar, Nambiar, Poess, Rabl, Big Data Journal, Vol.1, No.1, 60-64, Anne LiebertPublications. •In progress 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 10
  • 11. Big Data Benchmarks 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 11
  • 12. Types of Big Data Benchmarks •Micro-benchmarks. To evaluate specific lower-level, system operations •E.g., A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Panda et al, OSU •Functional benchmarks. Specific high-level function. •E.g. Sorting: Terasort •E.g. Basic SQL: Individual SQL operations, e.g. Select, Project, Join, Order-By, … •Genre-specific benchmarks. Benchmarks related to type of data •E.g. Graph500. Breadth-first graph traversals •Application-level benchmarks •Measure system performance (hardware and software) for a given application scenario—with given data and workload 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 12
  • 13. Application-Level Benchmark Design Issues from WBDB •Audience: Who is the audience for the benchmark? •Marketing (Customers / End users) •Internal Use (Engineering) •Academic Use (Research and Development) •Is the benchmark for innovation or competition? •If a competitive benchmark is successful, it will be used for innovation •Application: What type of application should be modeled? •TPC: schema + transaction/query workload •BigData: Abstractions of a data processing pipeline, e.g. Internet-scale businesses 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 13
  • 14. App Level Issues -2 •Component vs. end-to-end benchmark. Is it possible to factor out a set of benchmark “components”, which can be isolated and plugged into an end-to- end benchmark? •The benchmark should consist of individual components that ultimately make up an end-to- end benchmark •Single benchmark specification: Is it possible to specify a single benchmark that captures characteristics of multiple applications ? •Maybe: Create a single, multi-step benchmark, with plausible end-to-end scenario 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 14
  • 15. App Level Issues -3 •Paper & Pencil vs. Implementation-based. Should the implementation be specification-driven or implementation-driven? •Start with an implementation and develop specification at the same time •Reuse. Can we reuse existing benchmarks? •Leverage existing work and built-up knowledgebase •Benchmark Data. Where do we get the data from? •Synthetic data generation: structured, non-structured data •Verifiability. Should there be a process for verification of results? •YES! 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 15
  • 16. Abstracting the Big Data World 1.Enterprise Data Warehouse + Other types of data •Structured enterprise data warehouse •Extend to incorporate semi-structured data, e.g. from weblogs, machine logs, clickstream, customer reviews, … •“Design time” schemas 2.Collection of heterogeneous data + Pipelines of processing •Enterprise data processing as a pipeline from data ingestion to transformation, extraction, subsetting, machine learning, predictive analytics •Data from multiple structured and non-structured sources •“Runtime” schemas –late binding, application-driven schemas 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 16
  • 17. Other Benchmarks discussed at WBDB •Big Decision, Jimmy Zhao, HP •HiBench/Hammer, LanYi, Intel •BigDataBench, JianfengZhan, Chinese Academy of Sciences •CloudSuite, OnurKocberber, EPFL •Genre specific benchmarks •Microbenchmarks 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 17
  • 18. The BigBenchProposal •End to end benchmark •Application level •Based on a product retailer (TPC-DS) •Focused on Parallel DBMS and MR engines •History •Launched at 1stWBDB, San Jose •Published at SIGMOD 2013 •Full spec at WBDB proceedings 2012 •Full kit at WBDB 2014 •Collaboration with Industry & Academia •First: Teradata, University of Toronto, Oracle, InfoSizing •Now: UofT, bankmark, Intel, Oracle, Microsoft, UCSD, Pivotal, Cloudera, InfoSizing, SAP, Hortonworks, Cisco, … 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 18
  • 19. Data Model 19 Unstructured Data Semi-Structured Data Structured Data Sales Customer Item Marketprice Web Page Web Log Reviews Adapted TPC-DS BigBench Specific 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl
  • 20. •Variety •Different schema parts •Volume •Based on scale factor •Similar to TPC-DS scaling, but continuous •Weblogs & product reviews also scaled •Velocity •Refreshes for all data •Different velocity for different areas •Vstructured< Vunstructured< Vsemistructured Data Model –3 Vs 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 20
  • 21. Workload •Workload Queries •30 “queries” •Specified in English (sort of) •No required syntax •Business functions (Adapted from McKinsey) •Marketing •Cross-selling, Customer micro-segmentation, Sentiment analysis, Enhancing multichannel consumer experiences •Merchandising •Assortment optimization, Pricing optimization •Operations •Performance transparency, Product return analysis •Supply chain •Inventory management •Reporting (customers and products) 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 21
  • 22. SQL-MR Query 1 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 22
  • 23. HiveQLQuery 1 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 23 SELECTpid1, pid2, COUNT (*) AS cnt FROM ( FROM ( FROM ( SELECT s.ss_ticket_numberAS oid, s.ss_item_skAS pid FROM store_saless INNER JOIN item iON s.ss_item_sk= i.i_item_sk WHERE i.i_category_idin (1 ,2 ,3) and s.ss_store_skin (10 , 20, 33, 40, 50) ) q01_temp_join MAP q01_temp_join.oid, q01_temp_join.pid USING 'cat' AS oid, pid CLUSTER BY oid ) q01_map_output REDUCE q01_map_output.oid, q01_map_output.pid USING 'java -cpbigbenchqueriesmr.jar:hive-contrib.jarde.bankmark.bigbench.queries.q01.Red' AS (pid1 BIGINT, pid2 BIGINT) ) q01_temp_basket GROUP BY pid1, pid2 HAVING COUNT (pid1) > 49 ORDER BY pid1, cnt, pid2;
  • 24. BigBench Current Status •All queries are available in Hive/Hadoop •New data generator (continuous scaling, realistic data) available •New metric available •Complete driver available •Refresh will be done soon •Full kit at WBDB 2014 •https://github.com/intel-hadoop/Big-Bench 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 24 Query Types Number of Queries Percentage Pure HiveQL 14 46% Mahout 5 17% OpenNLP 5 17% Custom MR 6 20%
  • 25. Big Decision, Jimmy Zhao, HP, 4thWBDB •Benchmark for A DSS/Data Mining solutions •Everything running in the same system •Engine of Analytics •Reflecting the real business model •Huge data volume •Data from Social •Data from Web log •Data from Comments •Broader Data support •Semi-structured data •Un-structured data •Continuous Data Integration •ETL just a normal job of the system •Data Integration whenever there’s data •Big Data Analytics Big Decision –Big TPC-DS! TPC-DS •Mature and proved workload for BI •Mix workloads •Well defined scale factors Semi + unstructured TPC-DS •Additional data and dimension from new data •Semi-structured and unstructured data •TB to PB or even Zeta Byte support NEW TPC-DS generator –Agile ETL •Continuously data generation and injection •Consider as part of the workloads New massive parallel processing technologies •Convert queries to SQL liked queries •Include interactive & regular Queries •Include Machine Learning jobs 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 25
  • 26. Agile ETL Marketing Big Decision Block Diagram SNS Marketing TPC-DS Web page Sales Web log Item Reviews Social Message Search & Social Advertise Search Social Advertise Social Web pages Extraction Transform Load Customer Social Feedbacks Mobile log Mobile log 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 26
  • 27. HiBench, LanYi, Intel, 4thWBDB 26.06.2014 HiBench –Enhanced DFSIO Micro Benchmarks Web Search –Sort –WordCount –TeraSort –Nutch Indexing –Page Rank Machine Learning –Bayesian Classification –K-Means Clustering HDFSSee our paper “The HiBenchSuite: Characterization of the MapReduce-Based Data Analysis” in ICDE’10 workshops (WISS’10) 1.Different from GridMix, SWIM? 2.Micro Benchmark? 3.Isolated components? 4.End-2-end Benchmark? 5.We need ETL-Recommendation Pipeline Crafting Benchmarks for Big Data - Tilmann Rabl 27
  • 28. ETL-Recommendation (hammer) 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 28
  • 29. ETL-Recommendation (hammer) •Task Dependences Pref-logs ETL-logs Pref-sales Item based Collaborative Filtering Pref-comb ETL-sales Offline test 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 29
  • 30. The Deep Analytics Pipeline, Bhandarkar (1stWBDB) •“User Modeling” pipelines •Generic use case: Determine user interests or user categories by mining user activities •Large dimensionality of possible user activities •Typical user represents a sparse activity vector •Event attributes change over time 30 Data Acquisition/ Normalization/ Sessionization Feature andTarget Generation Model Training Offline Scoring & Evaluation Batch Scoring & Upload toServer Acquisition/ Recording Extraction/ Cleaning/ Annotation Integration/ Aggregation/ Representation Analysis/ Modeling Interpretation 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl
  • 31. Example Application Domains •Retail •Events: clicks on purchases, ad clicks, FB likes, … •Goal: Personalized product recommendations •Datacenters •Events: log messages, traffic, communications events, … •Goal: Predict imminent failures •Healthcare •Events: Doctor visits, medical history, medicine refills, … •Goal: Prevent hospital readmissions •Telecom •Events: Calls made, duration, calls dropped, location, social graph, … •Goal: Reduce customer churn •Web Ads •Events: Clicks on content, likes, reposts, search queries, comments, … •Goal: Increase engagement, increase clicks on revenue-generation content 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 31
  • 32. Steps in the Pipeline •Acquisition and normalization of data •Collate, consolidate data •Join targets and features •Construct targets; filter out user activity without targets; join feature vector with targets •Model Training •Multi-model: regressions, Naïve Bayes, decision trees, Support Vector Machines, … •Offline scoring •Score features, evaluate metrics •Batch scoring •Apply models to all user activity; upload scores 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 32
  • 33. Application Classes •Widely varying number of events per entity •Multiple classes of applications, based on size, e.g.: •Tiny (100K entities, 10 events per entity) •Small (1M entities, 10 events per entity) •Medium (10M entities, 100 events per entity) •Large (100M entities, 1000 events per entity) •Huge (1B entities, 1000 events per entity) 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 33
  • 34. Proposal for Pipeline Benchmark Results •Publish results for every stage in the pipeline •Data pipelines for different application domains may be constructed by mix and match of various pipeline stages •Different modeling techniques per class •So, need to publish performance numbers for every stage 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 34
  • 35. Get involved •Workshop on Big Data Benchmarking (WBDB) •Fifth workshop: August 6-7, Potsdam, Germany •clds.ucsd.edu/wbdb2014.de •Proceedings will be published in Springer LNCS •Big Data Benchmarking Community •Biweekly conference calls (sort of) •Mailing list •clds.ucsd.edu/bdbc/community •Coming up next: BDBC@SPEC Research •We will join forces with SPEC Research •Try BigBench: •https://github.com/intel-hadoop/Big-Bench 26.06.2014 Crafting Benchmarks for Big Data - Tilmann Rabl 35
  • 36. Questions? Crafting Benchmarks for Big Data - Tilmann Rabl 36 Contact: Tilmann Rabl tilmann.rabl@bankmark.de tilmann.rabl@utoronto.ca 26.06.2014 MIDDLEWARESYSTEMS RESEARCH GROUP MSRG.ORG Thank You!