SlideShare a Scribd company logo
1 of 41
Download to read offline
@BDOOP_BCN
Benchmarking Hadoop
by Nicolas Poggi @ni_po
June 2, 2015
About Nicolas Poggi @ni_po
What is BDOOP about?
● A group to share on Data
● Scalability
● Performance
● Configurations
● Cluster design
● Benchmarking
● …a/couple of beer/s!
• Having sysadmins in mind
● Also POs
● Not a group to learn
• Java
• Mapreduce programming
• Hadoop base concepts
BDOOP Group Objectives
● Create a local community to
● Learn Big Data
● performance and scalability
● Share
● day-to-day problems and solutions
● Present your work and findings
● Have talks from renown experts
● > Your objective here <
Benchmarking Motivation and Intro
Hadoop design
 Hadoop designed to solve complex data
 Structured and non structured
 With [close to] linear scalability
 Simplifying the programming model
 From MPI, OpenMP, CUDA, …
 Operates as a blackbox for data analysts
Image source: Hadoop, the definitive guide
Hadoop attributes
 Fault tolerant
 from commodityhardware
 Built in redundancy
 via replication
 Automatic scales out / down
 With [almost] linear scalability
 Move computation to data
 minimize communication
 Share nothing architecture
Hadoop highly-scalable but…
 Not a high-performance solution!
 Requires
 Design,
 Clusters, topology clusters
 Setup,
 OS, Hadoop config
 and tuning required
 Iterative approach
 Time consuming
 And extensive benchmarking!
Hadoop parameters
 > 100+ tunable parameters
 mapred.map/reduce.tasks.speculative.execution
 obscure and interrelated
 io.sort.mb 100 (300)
 io.sort.record.percent 5% (15%)
 io.sort.spill.percent 80% (95 – 100%)
 Number of Mappers and Reducers
 Rule of thumb 0.5 - 2 per CPU core
Hadoop ecosystem
 Large and spread
 Dominated by big players
 Custom patches
 Default values not ideal
 Product claims
 Cloud vs. On-premise
 IaaS
 PaaS
 EMR, HDInsight
 Needs standardization
and auditing!
DATA
Product claims
 Need auditing!
Workload (jobs)
 All jobs are different!
 Different requirements
 CPU bound
 Memory bound
 I/O bound
 … a bit of all
 Different tuning for
each
 Needs benchmarking!
Terasort
K-means
Wordcount
Sample mappers and reducer for 3 popular
benchmarks:
One for all config?
Vertical line:Average performance forthisworkloadacrossconfigurations
Valuesto the right: above average
Valuesto the left: below average
Is there one software configurationiterationthat fits everybody?
Configurations
Good for Terasort but
bad for Wordcount
Good for Terasort but
bad for Wordcount
Good for Wordcount but
very bad for Terasort
Example of SSD impact to Execution time
 Impact of SSDs to running time of Terasort
SSDs
HDDs
Configurations
SSD
SATA
Too many choices?
Remote volumes
-
-
Rotational HDDs
JBODs
Large VMs
Small VMs
GbEthernet
InfiniBand
RAID
Cost
Performance
On-Premise
Cloud
And where is my system
configurationpositionedon
each of these axes?
Highavailability
Replication
+
+
Benchmarks
Why benchmark?
 Validate assumptions
 Reproduce bad behavior
 Debugging
 Measure performance and scale
 Simulate higher load
 Find bottlenecks/ limits
 Plan for growth
 Test different
 SW and HW
Source: Based on High Performance MySQL, benchmarking MySQL chapter
Benchmarking stakeholders and use cases
 End-user / consumer
 Compare products
 Developer
 Profiling
 CI / QA
 Sysadmin / architect
 Cluster sizing
 SW and HW vendors
 Product claims
 Marketing
 Researcher
 …
Big Data Vs
 Volume
 Velocity
 Variety
 Structured, semi, unstructured data
 Different types of data (genres)
 Veracity
 Value
Sample scale factorfrom TPCx-HS
Data generation
 Real vs. Synthetic
 Random data vs. repeatable
 Datageneration time
 Paralle
 Datadistribution
 Flat or uniformly distributed
 Gaussian (normal distribution,
skew)
Issues Benchmarking Big Data
 Big Scale
 Single node vs Multiple nodes
 10MB vs 10TB
 On-metal vs. virtualized vs. cloud
 Non-deterministic/ Randomness
 Need to average multiple runs
 How long to benchmark
 Systemwarm-up
 Distributed systems
 Failures?
Types of benchmarks and Standards
 Micro benchmarks
 HDFSIO
 Functional
 Terasort, ETL
 Genre-specific
 Graph 500
 Application level
 BigBench
 TPC (implementation) vs SPEC (reference)
TPC vs. SPEC models
 Specification based
 Performance, price,
energy in one benchmark
 End-to-end
 Multiple tests (ACID, load)
 Independent review
 Full disclosure
 TPC Technology
Conference
 Kit based
 Performance and energy
in separate benchmarks
 Server-centric
 Single test
 Peer review
 Summary disclosure
 SPEC Research Group,
ICPE
Source: From presentation by Meikel Poess, 1stWBDB, May 2012
Data Benchmarks
Classical SQL OLAP DB Big Data
 First there was TPC-H
 Classical SQL OLAP
benchmark
 MRBench for M/R
 On top of Hive or Impala
for Hadoop
 Then sorting
 Terasort
 Unofficial standard
 Now part of TPCx-HS
 Hadoop samples
 Wordcount, grep,terasort,DFSIO
 YCSB
 From Yahoo!
 For NoSQL, HBASE implementation
 GridMix
 CALDA
 HiBench
 SWIM
 BigBench
 based on TCP-DS + ML
 30 queries
 BigDataBench
 33 workloads
 TPCx-HS
Comparisonof popularHadoop benchmarks
Spec[1
]
App
domains
Workload
types
Workload
s
Scalable
data
sets[2]
Diverse
implem[3]
Multi-
tenancy[4]
Subset[5] Simulator
[6]
BigDataBench Y Five Four[7] Thirty-
three [8] Eight[9] Y Y Y Y
BigBench Y One Three Ten Three N N N N
CloudSuite N N/A Two Eight Three N N N Y
HiBench N N/A Two Ten Three N N N N
CALDA Y N/A One Five N/A Y N N N
YCSB Y N/A One Six N/A Y N N N
LinkBench Y N/A One Ten N/A Y N N N
AMP Benchma
rks
Y N/A One Four N/A Y N N N
The Differences of BigDataBench from Other Benchmarks Suites.
Source: BigDataBench homepage
What to measure and metrics
 Job execution time
 Throughput
 Units / time
 Framework overhead
 # of spills
 Scalability
 Concurrency
 Abstract metrics
 CPU
 MEM
 DISK
 IOPS, latency, bandwidth
 NET
 Latency bandwidth
 TPCx-HS performance
metric (HSph@SF)
Benchmarking
Project ALOJA online repository
 Entry point for explore the results collected from the
executions,
 Provides insights on the obtained results through
continuouslyevolving data views.
 Online results at: http://hadoop.bsc.es
ALOJA Platform: Evolution and status
 Benchmarking, Repository, and Analytics tools for Big Data
 Composed of open-source
 Benchmarking, provisioning and orchestration tools,
 high-level system performance metric collection,
 low-level Hadoop instrumentation based on BSC Tools
 and Web based data analytics tools
 Andrecommendations
 Online Big Data Benchmark repository of:
 20,000+ runs (from HiBench)
 Sharable, comparable, repeatable, verifiable executions
 Abstracting and leveraging tools for BD benchmarking
 Not reinventing the wheel but,
 most current BD tools designed for production, not for benchmarking
 leverages current compatible tools and projects
 Dev VM toolset and sandbox
 via Vagrant
Big Data
Benchmarking
Online
Repository
Analytics
Workflow in ALOJA
Cluster(s)
definition
• VM sizes
• # nodes
• OS, disks
• Capabilities
Execution
plan
• Start cluster
• Exec Benchmarks
• Gather results
• Cleanup
Import
data
• Convert perf metric
• Parse logs
• Import into DB
Evaluate
data
• Data views in Vagrant VM
• Or http://hadoop.bsc.es
PA and KD
•Predictive
Analytics
•Knowledge
Discovery
Historic
Repo
34
Benchmarks Execution comparisons
 You can compare, side by side, all execution parameters:
 CPU, Memory, Network, Disk, Hadoop parameters….
Sample:
http://hadoop.bsc.es/perfcharts?execs[]=91144
HiBench suiteHiBench : A Benchmark Suite for Hadoop
HiBench
A Comprehensive & Realistic Benchmark Suite
Enhanced DFSIO
Micro Benchmarks Web Search
Sort
WordCount
TeraSort
Nutch Indexing
Page Rank
Machine Learning
Bayesian Classification
K-Means Clustering
HDFS
Code at: https://github.com/intel-hadoop/HiBench
Job resource requirements 1/2
Source: Intel HiBench
Job resource requirements 2/2
Source: Intel HiBench
Impact of SW configurations in Speedup
Number of mappers Compression algorithm
No comp.
ZLIB
BZIP2
snappy
4m
6m
8m
10m
Speedup (higher is better)
Results using: http://hadoop.bsc.es/configimprovement
Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
Impact of HW configurationsin Speedup
Disks and Network Cloud remote volumes
Local only
1 Remote
2 Remotes
3 Remotes
3 Remotes
/tmp local
2 Remotes
/tmp local
1 Remotes
/tmp local
HDD-ETH
HDD-IB
SSD-ETH
SDD-IB
Speedup (higher is better)
Results using: http://hadoop.bsc.es/configimprovement
Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
Cost/Performance Scalability
 Terasort (100GB)
Sample from: http://hadoop.bsc.es/nodeseval
Execution time Execution cost
InfiniBand + SDD (LOCAL)
GbE SDD + (LOCAL) CLOUD (local disk/tmpand HDFS)
CLOUD (/tmpinLocal Disk, HDFSin Blob storage 1-3
devices)
CLOUD (/tmpandHDFSin Blob storage
1-3 devices)
InfiniBand + SATA disks (LOCAL)
GbE+ SATA disks (LOCAL)
Price
Performance
Cost-effectiveness On-premise vs. Cloud)
Details at: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
Common Benchmarking pitfalls
 Scalability
 Assuming near scalability
 Compare apples to apples
 if benchmarking HW change HW
 but leave SW the same
 Terasort in v1 != Terasort in v2
 Test for Big Data use large data
 stress the system
 If results are too good to be true, they
probably aren't
 Don’t believe in miracles
 Expect vendor lies
Source: adapted from Benchmarking Big Data Systems by YANPEI CHEN and GWEN SHAPIRA at Big Data Spain
Resources
 ALOJA Benchmarking platform and online repository
 http://hadoop.bsc.es/
 Big Data Benchmarking Community (BDBC) mailing list
 (~200 members from ~80organizations)
 http://clds.sdsc.edu/bdbc/community
 Workshop Big Data Benchmarking (WBDB)
 Next: http://clds.sdsc.edu/wbdb2015.ca
 SPEC Research Big Data working group
 http://research.spec.org/working-groups/big-data-working-group.html
 Slides and video:
 Michael Frank on Big Data benchmarking
 http://www.tele-task.de/archive/podcast/20430/
 Tilmann Rabl Big Data Benchmarking Tutorial
 http://www.slideshare.net/tilmann_rabl/ieee2014-tutorialbarurabl
@BDOOP_BCN
Benchmarking Hadoop
by Nicolas Poggi @ni_po
June 2, 2015

More Related Content

What's hot

The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudNicolas Poggi
 
Big Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDK
Big Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDKBig Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDK
Big Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDKPrincipled Technologies
 
Introducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big DataIntroducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big Datainside-BigData.com
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRDouglas Bernardini
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Nicolas Poggi
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04Ted Dunning
 

What's hot (20)

The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Big Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDK
Big Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDKBig Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDK
Big Data Technology on Red Hat Enterprise Linux: OpenJDK vs. Oracle JDK
 
Introducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big DataIntroducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big Data
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04
 

Viewers also liked

TestDFSIO
TestDFSIOTestDFSIO
TestDFSIOhhyin
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Keeping Pressure Vessels Safe with the Sharck™  Probe
Keeping Pressure Vessels Safe with the Sharck™  ProbeKeeping Pressure Vessels Safe with the Sharck™  Probe
Keeping Pressure Vessels Safe with the Sharck™  ProbeEddyfi
 
Assigment 6
Assigment 6Assigment 6
Assigment 6fuzuli41
 
что такое Smm в 2013 году на примере
что такое Smm в 2013 году на примеречто такое Smm в 2013 году на примере
что такое Smm в 2013 году на примереАнтон Чернятин
 
Engineering Mechanics Statics design problem # 5.4 concrete chut by Kehali...
Engineering Mechanics Statics  design problem  # 5.4  concrete chut by Kehali...Engineering Mechanics Statics  design problem  # 5.4  concrete chut by Kehali...
Engineering Mechanics Statics design problem # 5.4 concrete chut by Kehali...kehali Haileselassie
 
Bipolar junction transistor characterstics biassing and amplification, lab 9
Bipolar junction transistor characterstics biassing and amplification, lab 9Bipolar junction transistor characterstics biassing and amplification, lab 9
Bipolar junction transistor characterstics biassing and amplification, lab 9kehali Haileselassie
 
Inspection of Stainless Steel Heat Exchanger Tubes with Eddy Current Array Probe
Inspection of Stainless Steel Heat Exchanger Tubes with Eddy Current Array ProbeInspection of Stainless Steel Heat Exchanger Tubes with Eddy Current Array Probe
Inspection of Stainless Steel Heat Exchanger Tubes with Eddy Current Array ProbeEddyfi
 
The avanti group sharp turn for electronics company
The avanti group sharp turn for electronics companyThe avanti group sharp turn for electronics company
The avanti group sharp turn for electronics companyApplecherr McDougal
 
Bahasa indonesia teks laporan hasil observasi
Bahasa indonesia teks laporan hasil observasiBahasa indonesia teks laporan hasil observasi
Bahasa indonesia teks laporan hasil observasiSri Utanti
 
High-Speed Remote-Field Testing in Carbon Steel Tubing
High-Speed Remote-Field Testing in Carbon Steel TubingHigh-Speed Remote-Field Testing in Carbon Steel Tubing
High-Speed Remote-Field Testing in Carbon Steel TubingEddyfi
 
Texture powerpoint final
Texture powerpoint finalTexture powerpoint final
Texture powerpoint finalkphan22
 
Inspecting In-Service Storage Tank Annular Rings for Corrosion
Inspecting In-Service Storage Tank Annular Rings for CorrosionInspecting In-Service Storage Tank Annular Rings for Corrosion
Inspecting In-Service Storage Tank Annular Rings for CorrosionEddyfi
 
MASALAH EKONOMI
MASALAH EKONOMIMASALAH EKONOMI
MASALAH EKONOMISri Utanti
 
Defect Detection & Prevention in Cast Turbine Wheels
Defect Detection & Prevention in Cast Turbine WheelsDefect Detection & Prevention in Cast Turbine Wheels
Defect Detection & Prevention in Cast Turbine WheelsEddyfi
 
JLL JF 100 Excercise Bike Manual
JLL JF 100 Excercise Bike ManualJLL JF 100 Excercise Bike Manual
JLL JF 100 Excercise Bike ManualJLL Fitness
 
JLL Electronics Treadmills Magzine
JLL Electronics Treadmills MagzineJLL Electronics Treadmills Magzine
JLL Electronics Treadmills MagzineJLL Fitness
 

Viewers also liked (19)

TeraSort
TeraSortTeraSort
TeraSort
 
TestDFSIO
TestDFSIOTestDFSIO
TestDFSIO
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Keeping Pressure Vessels Safe with the Sharck™  Probe
Keeping Pressure Vessels Safe with the Sharck™  ProbeKeeping Pressure Vessels Safe with the Sharck™  Probe
Keeping Pressure Vessels Safe with the Sharck™  Probe
 
Assigment 6
Assigment 6Assigment 6
Assigment 6
 
что такое Smm в 2013 году на примере
что такое Smm в 2013 году на примеречто такое Smm в 2013 году на примере
что такое Smm в 2013 году на примере
 
Engineering Mechanics Statics design problem # 5.4 concrete chut by Kehali...
Engineering Mechanics Statics  design problem  # 5.4  concrete chut by Kehali...Engineering Mechanics Statics  design problem  # 5.4  concrete chut by Kehali...
Engineering Mechanics Statics design problem # 5.4 concrete chut by Kehali...
 
Bipolar junction transistor characterstics biassing and amplification, lab 9
Bipolar junction transistor characterstics biassing and amplification, lab 9Bipolar junction transistor characterstics biassing and amplification, lab 9
Bipolar junction transistor characterstics biassing and amplification, lab 9
 
Inspection of Stainless Steel Heat Exchanger Tubes with Eddy Current Array Probe
Inspection of Stainless Steel Heat Exchanger Tubes with Eddy Current Array ProbeInspection of Stainless Steel Heat Exchanger Tubes with Eddy Current Array Probe
Inspection of Stainless Steel Heat Exchanger Tubes with Eddy Current Array Probe
 
The avanti group sharp turn for electronics company
The avanti group sharp turn for electronics companyThe avanti group sharp turn for electronics company
The avanti group sharp turn for electronics company
 
Bahasa indonesia teks laporan hasil observasi
Bahasa indonesia teks laporan hasil observasiBahasa indonesia teks laporan hasil observasi
Bahasa indonesia teks laporan hasil observasi
 
High-Speed Remote-Field Testing in Carbon Steel Tubing
High-Speed Remote-Field Testing in Carbon Steel TubingHigh-Speed Remote-Field Testing in Carbon Steel Tubing
High-Speed Remote-Field Testing in Carbon Steel Tubing
 
Texture powerpoint final
Texture powerpoint finalTexture powerpoint final
Texture powerpoint final
 
Inspecting In-Service Storage Tank Annular Rings for Corrosion
Inspecting In-Service Storage Tank Annular Rings for CorrosionInspecting In-Service Storage Tank Annular Rings for Corrosion
Inspecting In-Service Storage Tank Annular Rings for Corrosion
 
OBA.BY
OBA.BYOBA.BY
OBA.BY
 
MASALAH EKONOMI
MASALAH EKONOMIMASALAH EKONOMI
MASALAH EKONOMI
 
Defect Detection & Prevention in Cast Turbine Wheels
Defect Detection & Prevention in Cast Turbine WheelsDefect Detection & Prevention in Cast Turbine Wheels
Defect Detection & Prevention in Cast Turbine Wheels
 
JLL JF 100 Excercise Bike Manual
JLL JF 100 Excercise Bike ManualJLL JF 100 Excercise Bike Manual
JLL JF 100 Excercise Bike Manual
 
JLL Electronics Treadmills Magzine
JLL Electronics Treadmills MagzineJLL Electronics Treadmills Magzine
JLL Electronics Treadmills Magzine
 

Similar to Benchmarking Hadoop and Big Data

LAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLinaro
 
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Ganesh Raju
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...
How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...
How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...DevOps.com
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsDirecti Group
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHortonworks
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Amazon Web Services
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8MongoDB
 
Bodo Value Guide.pdf
Bodo Value Guide.pdfBodo Value Guide.pdf
Bodo Value Guide.pdfGregHanchin1
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BIPrasad Prabhu (PP)
 
1 extreme performance - part i
1   extreme performance - part i1   extreme performance - part i
1 extreme performance - part isqlserver.co.il
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Community
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics PlatformSantanu Dey
 

Similar to Benchmarking Hadoop and Big Data (20)

LAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96Boards
 
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...
How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...
How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale Systems
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar Slides
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Ceph
CephCeph
Ceph
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8
 
Bodo Value Guide.pdf
Bodo Value Guide.pdfBodo Value Guide.pdf
Bodo Value Guide.pdf
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
 
1 extreme performance - part i
1   extreme performance - part i1   extreme performance - part i
1 extreme performance - part i
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
 

More from Nicolas Poggi

Benchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA ConstraintsBenchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA ConstraintsNicolas Poggi
 
Correctness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLCorrectness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLNicolas Poggi
 
State of Spark in the cloud (Spark Summit EU 2017)
State of Spark in the cloud (Spark Summit EU 2017)State of Spark in the cloud (Spark Summit EU 2017)
State of Spark in the cloud (Spark Summit EU 2017)Nicolas Poggi
 
The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloudNicolas Poggi
 
Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)Nicolas Poggi
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
Vagrant + Docker provider [+Puppet]
Vagrant + Docker provider [+Puppet]Vagrant + Docker provider [+Puppet]
Vagrant + Docker provider [+Puppet]Nicolas Poggi
 
The case for Hadoop performance
The case for Hadoop performanceThe case for Hadoop performance
The case for Hadoop performanceNicolas Poggi
 

More from Nicolas Poggi (8)

Benchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA ConstraintsBenchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA Constraints
 
Correctness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLCorrectness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQL
 
State of Spark in the cloud (Spark Summit EU 2017)
State of Spark in the cloud (Spark Summit EU 2017)State of Spark in the cloud (Spark Summit EU 2017)
State of Spark in the cloud (Spark Summit EU 2017)
 
The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloud
 
Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Vagrant + Docker provider [+Puppet]
Vagrant + Docker provider [+Puppet]Vagrant + Docker provider [+Puppet]
Vagrant + Docker provider [+Puppet]
 
The case for Hadoop performance
The case for Hadoop performanceThe case for Hadoop performance
The case for Hadoop performance
 

Recently uploaded

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Recently uploaded (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Benchmarking Hadoop and Big Data

  • 1. @BDOOP_BCN Benchmarking Hadoop by Nicolas Poggi @ni_po June 2, 2015
  • 3. What is BDOOP about? ● A group to share on Data ● Scalability ● Performance ● Configurations ● Cluster design ● Benchmarking ● …a/couple of beer/s! • Having sysadmins in mind ● Also POs ● Not a group to learn • Java • Mapreduce programming • Hadoop base concepts
  • 4. BDOOP Group Objectives ● Create a local community to ● Learn Big Data ● performance and scalability ● Share ● day-to-day problems and solutions ● Present your work and findings ● Have talks from renown experts ● > Your objective here <
  • 6. Hadoop design  Hadoop designed to solve complex data  Structured and non structured  With [close to] linear scalability  Simplifying the programming model  From MPI, OpenMP, CUDA, …  Operates as a blackbox for data analysts Image source: Hadoop, the definitive guide
  • 7. Hadoop attributes  Fault tolerant  from commodityhardware  Built in redundancy  via replication  Automatic scales out / down  With [almost] linear scalability  Move computation to data  minimize communication  Share nothing architecture
  • 8. Hadoop highly-scalable but…  Not a high-performance solution!  Requires  Design,  Clusters, topology clusters  Setup,  OS, Hadoop config  and tuning required  Iterative approach  Time consuming  And extensive benchmarking!
  • 9. Hadoop parameters  > 100+ tunable parameters  mapred.map/reduce.tasks.speculative.execution  obscure and interrelated  io.sort.mb 100 (300)  io.sort.record.percent 5% (15%)  io.sort.spill.percent 80% (95 – 100%)  Number of Mappers and Reducers  Rule of thumb 0.5 - 2 per CPU core
  • 10. Hadoop ecosystem  Large and spread  Dominated by big players  Custom patches  Default values not ideal  Product claims  Cloud vs. On-premise  IaaS  PaaS  EMR, HDInsight  Needs standardization and auditing! DATA
  • 12. Workload (jobs)  All jobs are different!  Different requirements  CPU bound  Memory bound  I/O bound  … a bit of all  Different tuning for each  Needs benchmarking! Terasort K-means Wordcount Sample mappers and reducer for 3 popular benchmarks:
  • 13. One for all config? Vertical line:Average performance forthisworkloadacrossconfigurations Valuesto the right: above average Valuesto the left: below average Is there one software configurationiterationthat fits everybody? Configurations Good for Terasort but bad for Wordcount Good for Terasort but bad for Wordcount Good for Wordcount but very bad for Terasort
  • 14. Example of SSD impact to Execution time  Impact of SSDs to running time of Terasort SSDs HDDs Configurations SSD SATA
  • 15. Too many choices? Remote volumes - - Rotational HDDs JBODs Large VMs Small VMs GbEthernet InfiniBand RAID Cost Performance On-Premise Cloud And where is my system configurationpositionedon each of these axes? Highavailability Replication + +
  • 17. Why benchmark?  Validate assumptions  Reproduce bad behavior  Debugging  Measure performance and scale  Simulate higher load  Find bottlenecks/ limits  Plan for growth  Test different  SW and HW Source: Based on High Performance MySQL, benchmarking MySQL chapter
  • 18. Benchmarking stakeholders and use cases  End-user / consumer  Compare products  Developer  Profiling  CI / QA  Sysadmin / architect  Cluster sizing  SW and HW vendors  Product claims  Marketing  Researcher  …
  • 19. Big Data Vs  Volume  Velocity  Variety  Structured, semi, unstructured data  Different types of data (genres)  Veracity  Value Sample scale factorfrom TPCx-HS
  • 20. Data generation  Real vs. Synthetic  Random data vs. repeatable  Datageneration time  Paralle  Datadistribution  Flat or uniformly distributed  Gaussian (normal distribution, skew)
  • 21. Issues Benchmarking Big Data  Big Scale  Single node vs Multiple nodes  10MB vs 10TB  On-metal vs. virtualized vs. cloud  Non-deterministic/ Randomness  Need to average multiple runs  How long to benchmark  Systemwarm-up  Distributed systems  Failures?
  • 22. Types of benchmarks and Standards  Micro benchmarks  HDFSIO  Functional  Terasort, ETL  Genre-specific  Graph 500  Application level  BigBench  TPC (implementation) vs SPEC (reference)
  • 23. TPC vs. SPEC models  Specification based  Performance, price, energy in one benchmark  End-to-end  Multiple tests (ACID, load)  Independent review  Full disclosure  TPC Technology Conference  Kit based  Performance and energy in separate benchmarks  Server-centric  Single test  Peer review  Summary disclosure  SPEC Research Group, ICPE Source: From presentation by Meikel Poess, 1stWBDB, May 2012
  • 24. Data Benchmarks Classical SQL OLAP DB Big Data  First there was TPC-H  Classical SQL OLAP benchmark  MRBench for M/R  On top of Hive or Impala for Hadoop  Then sorting  Terasort  Unofficial standard  Now part of TPCx-HS  Hadoop samples  Wordcount, grep,terasort,DFSIO  YCSB  From Yahoo!  For NoSQL, HBASE implementation  GridMix  CALDA  HiBench  SWIM  BigBench  based on TCP-DS + ML  30 queries  BigDataBench  33 workloads  TPCx-HS
  • 25. Comparisonof popularHadoop benchmarks Spec[1 ] App domains Workload types Workload s Scalable data sets[2] Diverse implem[3] Multi- tenancy[4] Subset[5] Simulator [6] BigDataBench Y Five Four[7] Thirty- three [8] Eight[9] Y Y Y Y BigBench Y One Three Ten Three N N N N CloudSuite N N/A Two Eight Three N N N Y HiBench N N/A Two Ten Three N N N N CALDA Y N/A One Five N/A Y N N N YCSB Y N/A One Six N/A Y N N N LinkBench Y N/A One Ten N/A Y N N N AMP Benchma rks Y N/A One Four N/A Y N N N The Differences of BigDataBench from Other Benchmarks Suites. Source: BigDataBench homepage
  • 26. What to measure and metrics  Job execution time  Throughput  Units / time  Framework overhead  # of spills  Scalability  Concurrency  Abstract metrics  CPU  MEM  DISK  IOPS, latency, bandwidth  NET  Latency bandwidth  TPCx-HS performance metric (HSph@SF)
  • 28. Project ALOJA online repository  Entry point for explore the results collected from the executions,  Provides insights on the obtained results through continuouslyevolving data views.  Online results at: http://hadoop.bsc.es
  • 29. ALOJA Platform: Evolution and status  Benchmarking, Repository, and Analytics tools for Big Data  Composed of open-source  Benchmarking, provisioning and orchestration tools,  high-level system performance metric collection,  low-level Hadoop instrumentation based on BSC Tools  and Web based data analytics tools  Andrecommendations  Online Big Data Benchmark repository of:  20,000+ runs (from HiBench)  Sharable, comparable, repeatable, verifiable executions  Abstracting and leveraging tools for BD benchmarking  Not reinventing the wheel but,  most current BD tools designed for production, not for benchmarking  leverages current compatible tools and projects  Dev VM toolset and sandbox  via Vagrant Big Data Benchmarking Online Repository Analytics
  • 30. Workflow in ALOJA Cluster(s) definition • VM sizes • # nodes • OS, disks • Capabilities Execution plan • Start cluster • Exec Benchmarks • Gather results • Cleanup Import data • Convert perf metric • Parse logs • Import into DB Evaluate data • Data views in Vagrant VM • Or http://hadoop.bsc.es PA and KD •Predictive Analytics •Knowledge Discovery Historic Repo
  • 31. 34 Benchmarks Execution comparisons  You can compare, side by side, all execution parameters:  CPU, Memory, Network, Disk, Hadoop parameters…. Sample: http://hadoop.bsc.es/perfcharts?execs[]=91144
  • 32. HiBench suiteHiBench : A Benchmark Suite for Hadoop HiBench A Comprehensive & Realistic Benchmark Suite Enhanced DFSIO Micro Benchmarks Web Search Sort WordCount TeraSort Nutch Indexing Page Rank Machine Learning Bayesian Classification K-Means Clustering HDFS Code at: https://github.com/intel-hadoop/HiBench
  • 33. Job resource requirements 1/2 Source: Intel HiBench
  • 34. Job resource requirements 2/2 Source: Intel HiBench
  • 35. Impact of SW configurations in Speedup Number of mappers Compression algorithm No comp. ZLIB BZIP2 snappy 4m 6m 8m 10m Speedup (higher is better) Results using: http://hadoop.bsc.es/configimprovement Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
  • 36. Impact of HW configurationsin Speedup Disks and Network Cloud remote volumes Local only 1 Remote 2 Remotes 3 Remotes 3 Remotes /tmp local 2 Remotes /tmp local 1 Remotes /tmp local HDD-ETH HDD-IB SSD-ETH SDD-IB Speedup (higher is better) Results using: http://hadoop.bsc.es/configimprovement Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
  • 37. Cost/Performance Scalability  Terasort (100GB) Sample from: http://hadoop.bsc.es/nodeseval Execution time Execution cost
  • 38. InfiniBand + SDD (LOCAL) GbE SDD + (LOCAL) CLOUD (local disk/tmpand HDFS) CLOUD (/tmpinLocal Disk, HDFSin Blob storage 1-3 devices) CLOUD (/tmpandHDFSin Blob storage 1-3 devices) InfiniBand + SATA disks (LOCAL) GbE+ SATA disks (LOCAL) Price Performance Cost-effectiveness On-premise vs. Cloud) Details at: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
  • 39. Common Benchmarking pitfalls  Scalability  Assuming near scalability  Compare apples to apples  if benchmarking HW change HW  but leave SW the same  Terasort in v1 != Terasort in v2  Test for Big Data use large data  stress the system  If results are too good to be true, they probably aren't  Don’t believe in miracles  Expect vendor lies Source: adapted from Benchmarking Big Data Systems by YANPEI CHEN and GWEN SHAPIRA at Big Data Spain
  • 40. Resources  ALOJA Benchmarking platform and online repository  http://hadoop.bsc.es/  Big Data Benchmarking Community (BDBC) mailing list  (~200 members from ~80organizations)  http://clds.sdsc.edu/bdbc/community  Workshop Big Data Benchmarking (WBDB)  Next: http://clds.sdsc.edu/wbdb2015.ca  SPEC Research Big Data working group  http://research.spec.org/working-groups/big-data-working-group.html  Slides and video:  Michael Frank on Big Data benchmarking  http://www.tele-task.de/archive/podcast/20430/  Tilmann Rabl Big Data Benchmarking Tutorial  http://www.slideshare.net/tilmann_rabl/ieee2014-tutorialbarurabl
  • 41. @BDOOP_BCN Benchmarking Hadoop by Nicolas Poggi @ni_po June 2, 2015