SlideShare a Scribd company logo
Jun Liu, Senior Software Engineer
Bianny Bian, Engineering Manager
SSG/STO/PAC
System Technologies & Optimization (STO) 2
Agenda
Quick Overview of Impala
Design Challenges of an Impala Deployment
Case Study: Use Simulation-Based Approach to Design and Optimize an Impala
Cluster
What’s in side: Intel Cofluent Technology for Big Data
System Technologies & Optimization (STO) 3
Impala Overview
 Open-ource MPP query execution engine
 Built natively for Hadoop
 Efficiently access data stored in Hadoop using SQL
 Piplined execution mode enables fast data processing speed
System Technologies & Optimization (STO) 4
Design Challenges of an Impala Cluster – H/W
Meet Performance
Requirements
Plan For
the Future
Not Over
Provisioning
10 GB
50GB
1TB
5TB
10TB
System Technologies & Optimization (STO) 5
Example: Cluster Sizing
Requirements: a deep data analytic query over historical data should response
within 10 seconds
System Technologies & Optimization (STO) 6
Example: Storage Choice of One Use Case
~0.0448%
 In general, SSD is faster than HDD, but there’re exceptions
System Technologies & Optimization (STO)
• No impact on the illustrated workload running on the Text formatted table
• Scaling well when running on the Parquet formatted table
7
Example: CPU Frequency
System Technologies & Optimization (STO) 8
Design Challenges of an Impala Cluster – S/W
HDFS Cache
HDFS Block Size
Parquet Row Group Size
Software Configuration
Options....
....
....
System Technologies & Optimization (STO) 9
Example: HDFS Caching
System Technologies & Optimization (STO) 10
Design Challenge Summary
We have talked about deployment
challenges, in terms of:
• hardware selections and settings
• software configuration choices
There’s NO ONE SIZE FIT-ALL
solution to the design challenges one
would face with when deploying a
system for production.
Efficient Way to Predict
System Performance?
Current Approach
System Technologies & Optimization (STO)
Simulation Approach
Deploy on
Experimental
Cluster
Generate
Simulation
Report
Collect and
Analyze
System Log Simulation Plan
Change H/W config
Change H/W knobs
Adjust WL setting
System Technologies & Optimization (STO) 12
• Impala Query Execution Simulation
• Query Planning Flow
• Plan Nodes, Plan Fragments, Execution
Nodes Geneation
• Task Scheduling and Distribution
• Data Processing Flow (Pull & Push)
• Data Distribution (Data Skew and
Partitioning)
• Disk IO Scheduling and Scan Operations
• Execution nodes
Impala Simulator Overview
System Technologies & Optimization (STO) 13
One Banking Use Case Study
• Offline Customer Account Historical Data Analysis
– Complex and Deep Analytic Queries
– Low Latency Interactive Queries
– Reporting Queries
• Initially evaluated on Hive, now Impala
System Technologies & Optimization (STO) 14
Step1: Deploy an Experimental Cluster
• Deploy a 4-node cluster
• Small scale of the data
System Technologies & Optimization (STO) 15
Step2: Collect Simulation Input
Hardware Configurations
– Node Count
– Processor
– Storage
– Network
– Memory
Software Configurations
– HDFS
– Impala
File Format
Table / Column Metadata
– COMPUTE STATS
– SHOW TABLE STATS
– DESC FORMATTED
– SHOW COLUMN STATS
Query Profile - PROFILE
Tuple Descriptors
– Impala Daemon Log
System Technologies & Optimization (STO) 16
Example: Configure Table Meta Data
System Technologies & Optimization (STO) 17
Step 3: Baseline Validation on Experimental Cluster
System Technologies & Optimization (STO) 18
Not just query execution time.
We also compare with Impala Log File to
check the duration of each stage
• disk-io-mgr.cc: disk id (1) reading for ....
• exchange.cc: #rows ... instance_id = ...
HashJoin Build Phase
HashJoin Probe Phase
Aggregation
Hdfs Scan Operation
Exchange Execution Node
Disk Worker 4
Disk Worker 0
System Technologies & Optimization (STO) 19
Step 4: From Experimental Cluster to Production Cluster
• We have completed baseline verification on an experimental cluster
• Performance prediction for the production cluster
• Simulation assumptions:
• upper- and lower- data distribution boundaries
• small scale of the data
System Technologies & Optimization (STO) 20
Step 5: Simulation Plan for Production Cluster
File Format Compression Partition
Text
GZIP
PartitionedAvro Snappy
No Partition
Cached
No Cache
Cache
Parquet
No
Compression
... ...
CPU Freq Netw ork Cluster Size
2.7Gz
...
42.4Gz 10GbE
2
SDD
HDD
Disk Type
2.1Gz
1GbE
6 ...
Software Configuration
Matrix
Hardware Configuration
Matrix
System Technologies & Optimization (STO) 21
Software Performance Predication
System Technologies & Optimization (STO) 22
Software Performance Predication
> 40GB
data to cache
System Technologies & Optimization (STO) 23
Cache Impact on Text Formatted Data
With Cache Without Cache
HdfsScanNode finishes
at around 6 sec
HdfsScanNode finishes at
around 12 sec
System Technologies & Optimization (STO) 24
Cache Impact on Text Formatted Data
Block for a short period
waiting for RowBatches Execution nodes are busy
processing RowBatches
System Technologies & Optimization (STO) 25
Cache Impact on Parquet Formatted Data
With Cache Without Cache
Fast Scan, CPU Bound
System Technologies & Optimization (STO) 26
Cache Impact on Parquet Formatted Data
CPU Bound,Scan Speed Does Not Have Impact on Overall
Performance of Query Execution.
System Technologies & Optimization (STO)
GZIPParquet Partitioned Cached
14.45% 2.74% 0.49%-9.22%
27
Reporting
Workload
Deep Analytic
Workload
Baseline
10x Files to
Scan
CPU
Intensive
Avro Snappy Partitioned Cached
1.1% 7.37% 7.94% 4.62%
Text
No
Compression
No Partition No Cache
Software Configuration Recommandation
System Technologies & Optimization (STO)
Hardware Performance Predication
System Technologies & Optimization (STO) 29
Hardware Performance Predication
1.00
0.49
0.44 0.43 0.44 0.44
2 Nodes 4 Nodes 6 Nodes 8 Nodes 16 Nodes 20 Nodes
Baseline
Network Transfer Cost:
470.7 MS
Network Transfer Cost:
494.2 MS
System Technologies & Optimization (STO) 30
Hardware Performance Predication
Expected
Response
Time
System Technologies & Optimization (STO)
Overall Recommendation
Execution Time (Recommanded):
~12.4 seconds
Execution Time (Baseline):
~63.3 seconds
1.8Gz
256 MB
No Compression
Text
6 HDD
10GbE
4 Nodes
80%
Cluster Size
< 4x, 8 Nodes
< 10x, 16 Nodes
> 10x, 20 Nodes
System Technologies & Optimization (STO) 32
What’s Inside
System Technologies & Optimization (STO)
SCALE UP WITH CONFIDENCE:
Simulate to determine the minimum cost to meet
your future demand
FASTER CLUSTER DEPLOYMENT:
Explore deployment options and meet performance goals
OPTIMIZE CLUSTERS:
Find performance bottlenecks and optimize
software operation
Intel® CoFluent™ Technology for Big Data
System Technologies & Optimization (STO)
Intel® CoFluent™ Studio Based Simulation
Enables fast “What if?” analysis
with a virtual system
System Technologies & Optimization (STO)
Layered Simulation Architecture
H/W Resource Monitoring and Performance Library
CPU Memory Storage Ethernet …
Discrete Events Simulation Kernel on SystemC
Dynamic
S/W &
H/W
Mapping
S/W Stack
HBaseSpark M/R HDFS Impala
OS
JVM
…
System Topology
Role Assignment
Build a cluster
System Technologies & Optimization (STO)
Software Stack Coverage
YARN
System Technologies & Optimization (STO)
Hardware Coverage
Validated: 50 Nodes SSD & HDD
1GbE & 10GbE
Rack Scale Architecture
Pooled
Compute
Pooled
Memory
Pooled
I/O
System Technologies & Optimization (STO) 38
Simulation Accuracy
High Simulation Accuracy is achieved for Big Data applications running on
different cluster size, hardware configurations and software stacks.
System Technologies & Optimization (STO)
Fast Simulation
7
18
36
71
2
6
14
29
20 50 100 200
NUMBER OF CONCURRENT UPLOADING REQUESTS
Simulation vs. Real Time in minutes
Hardware - 4 node Cluster (min)
Simulation Speed - Lenovo T420 (min)
Abstract
Modeling
Event Driven
Simulation
System Technologies & Optimization (STO)
Host machine to run simulations
System Technologies & Optimization (STO) 41
Call to Actions
Visit cofluent.intel.com for more information
Request white papers
Various customer success stories and use cases available
– Optimize a 50-node Hive/MR Cluster
– Predict the scalability of a large HBase Cluster
– Software Parameter tunings for Spark Applications
– …
Demo in the showcase – Intel booth
System Technologies & Optimization (STO)
cofluent.intel.com
System Technologies & Optimization (STO)
Legal Notices and Disclaimers
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or
from the OEM or retailer.
No computer system can be absolutely secure.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance.
Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark
results, visit http://www.intel.com/performance.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future
costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact
your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number
of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on
Form 10-K.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current
characterized errata are available on request.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether
referenced data are accurate.
Intel, CoFluent, Xeon, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.
*Other names and brands may be claimed as the property of others.
© 2015 Intel Corporation.
System Technologies & Optimization (STO)
Risk FactorsThe above statements and any others in this document that refer to plans and expectations for the second quarter, the year and the future are forward-
looking statements that involve a number of risks and uncertainties. Words such as "anticipates," "expects," "intends," "plans," "believes," "seeks," "estimates,"
"may," "will," "should" and their variations identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or
assumptions also identify forward-looking statements. Many factors could affect Intel's actual results, and variances from Intel's current expectations
regarding such factors could cause actual results to differ materially from those expressed in these forward-looking statements. Intel presently considers the
following to be important factors that could cause actual results to differ materially from the company's expectations. Demand for Intel's products is highly
variable and could differ from expectations due to factors including changes in business and economic conditions; consumer confidence or income levels;
the introduction, availability and market acceptance of Intel's products, products used together with Intel products and competitors' products; competitive
and pricing pressures, including actions taken by competitors; supply constraints and other disruptions affecting customers; changes in customer order
patterns including order cancellations; and changes in the level of inventory at customers. Intel's gross margin percentage could vary significantly from
expectations based on capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; changes
in revenue levels; segment product mix; the timing and execution of the manufacturing ramp and associated costs; excess or obsolete inventory; changes in
unit costs; defects or disruptions in the supply of materials or resources; and product manufacturing quality/yields. Variations in gross margin may also be
caused by the timing of Intel product introductions and related expenses, including marketing expenses, and Intel's ability to respond quickly to
technological developments and to introduce new products or incorporate new features into existing products, which may result in restructuring and asset
impairment charges. Intel's results could be affected by adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its
customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and
fluctuations in currency exchange rates. Results may also be affected by the formal or informal imposition by countries of new or revised export and/or
import and doing-business regulations, which could be changed without prior notice. Intel operates in highly competitive industries and its operations have
high costs that are either fixed or difficult to reduce in the short term. The amount, timing and execution of Intel's stock repurchase program could be
affected by changes in Intel's priorities for the use of cash, such as operational spending, capital spending, acquisitions, and as a result of changes to Intel's
cash flows or changes in tax laws. Product defects or errata (deviations from published specifications) may adversely impact our expenses, revenues and
reputation. Intel's results could be affected by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust, disclosure
and other issues. An unfavorable ruling could include monetary damages or an injunction prohibiting Intel from manufacturing or selling one or more
products, precluding particular business practices, impacting Intel's ability to design its products, or requiring other remedies such as compulsory licensing
of intellectual property. Intel's results may be affected by the timing of closing of acquisitions, divestitures and other significant transactions. A detailed
discussion of these and other factors that could affect Intel's results is included in Intel's SEC filings, including the company's most recent reports on Form
10-Q, Form 10-K and earnings release.
Rev. 4/14/15

More Related Content

What's hot

Netezza pure data
Netezza pure dataNetezza pure data
Netezza pure data
Hossein Sarshar
 
Oracle db performance tuning
Oracle db performance tuningOracle db performance tuning
Oracle db performance tuningSimon Huang
 
Deep Learning Accelerator Design Techniques
Deep Learning Accelerator Design TechniquesDeep Learning Accelerator Design Techniques
Deep Learning Accelerator Design Techniques
Mindos Cheng
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithms
Sabidur Rahman
 
Awr1page OTW2018
Awr1page OTW2018Awr1page OTW2018
Awr1page OTW2018
John Beresniewicz
 
Contract-oriented PLSQL Programming
Contract-oriented PLSQL ProgrammingContract-oriented PLSQL Programming
Contract-oriented PLSQL Programming
John Beresniewicz
 
The IBM Netezza Data Warehouse Appliance
The IBM Netezza Data Warehouse ApplianceThe IBM Netezza Data Warehouse Appliance
The IBM Netezza Data Warehouse Appliance
IBM Sverige
 
Oracle databasecapacityanalysisusingstatisticalmethods
Oracle databasecapacityanalysisusingstatisticalmethodsOracle databasecapacityanalysisusingstatisticalmethods
Oracle databasecapacityanalysisusingstatisticalmethods
Ajith Narayanan
 
Intel's Machine Learning Strategy
Intel's Machine Learning StrategyIntel's Machine Learning Strategy
Intel's Machine Learning Strategy
inside-BigData.com
 
Oracle Exadata X2-8: A Critical Review
Oracle Exadata X2-8: A Critical ReviewOracle Exadata X2-8: A Critical Review
Oracle Exadata X2-8: A Critical Review
Texas Memory Systems, and IBM Company
 
Sun Oracle Exadata Technical Overview V1
Sun Oracle Exadata Technical Overview V1Sun Oracle Exadata Technical Overview V1
Sun Oracle Exadata Technical Overview V1jenkin
 
Why is my_oracle_e-biz_database_slow_a_million_dollar_question
Why is my_oracle_e-biz_database_slow_a_million_dollar_questionWhy is my_oracle_e-biz_database_slow_a_million_dollar_question
Why is my_oracle_e-biz_database_slow_a_million_dollar_question
Ajith Narayanan
 
Whitepaper: Exadata Consolidation Success Story
Whitepaper: Exadata Consolidation Success StoryWhitepaper: Exadata Consolidation Success Story
Whitepaper: Exadata Consolidation Success StoryKristofferson A
 
Oracle ebs capacity_analysisusingstatisticalmethods
Oracle ebs capacity_analysisusingstatisticalmethodsOracle ebs capacity_analysisusingstatisticalmethods
Oracle ebs capacity_analysisusingstatisticalmethods
Ajith Narayanan
 
Awr + 12c performance tuning
Awr + 12c performance tuningAwr + 12c performance tuning
Awr + 12c performance tuning
AiougVizagChapter
 
Adaptive Query Optimization
Adaptive Query OptimizationAdaptive Query Optimization
Adaptive Query Optimization
Anju Garg
 
An Introduction to Netezza
An Introduction to NetezzaAn Introduction to Netezza
An Introduction to Netezza
Vijaya Chandrika
 
Oracle Trace File Analyzer - What's New in 12.2.1.1.0
Oracle Trace File Analyzer - What's New in 12.2.1.1.0Oracle Trace File Analyzer - What's New in 12.2.1.1.0
Oracle Trace File Analyzer - What's New in 12.2.1.1.0
Gareth Chapman
 
Oracle real application_cluster
Oracle real application_clusterOracle real application_cluster
Oracle real application_cluster
Prabhat gangwar
 

What's hot (20)

Netezza pure data
Netezza pure dataNetezza pure data
Netezza pure data
 
Oracle db performance tuning
Oracle db performance tuningOracle db performance tuning
Oracle db performance tuning
 
Deep Learning Accelerator Design Techniques
Deep Learning Accelerator Design TechniquesDeep Learning Accelerator Design Techniques
Deep Learning Accelerator Design Techniques
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithms
 
Awr1page OTW2018
Awr1page OTW2018Awr1page OTW2018
Awr1page OTW2018
 
Contract-oriented PLSQL Programming
Contract-oriented PLSQL ProgrammingContract-oriented PLSQL Programming
Contract-oriented PLSQL Programming
 
The IBM Netezza Data Warehouse Appliance
The IBM Netezza Data Warehouse ApplianceThe IBM Netezza Data Warehouse Appliance
The IBM Netezza Data Warehouse Appliance
 
Oracle databasecapacityanalysisusingstatisticalmethods
Oracle databasecapacityanalysisusingstatisticalmethodsOracle databasecapacityanalysisusingstatisticalmethods
Oracle databasecapacityanalysisusingstatisticalmethods
 
Intel's Machine Learning Strategy
Intel's Machine Learning StrategyIntel's Machine Learning Strategy
Intel's Machine Learning Strategy
 
Oracle Exadata X2-8: A Critical Review
Oracle Exadata X2-8: A Critical ReviewOracle Exadata X2-8: A Critical Review
Oracle Exadata X2-8: A Critical Review
 
IBM Netezza
IBM NetezzaIBM Netezza
IBM Netezza
 
Sun Oracle Exadata Technical Overview V1
Sun Oracle Exadata Technical Overview V1Sun Oracle Exadata Technical Overview V1
Sun Oracle Exadata Technical Overview V1
 
Why is my_oracle_e-biz_database_slow_a_million_dollar_question
Why is my_oracle_e-biz_database_slow_a_million_dollar_questionWhy is my_oracle_e-biz_database_slow_a_million_dollar_question
Why is my_oracle_e-biz_database_slow_a_million_dollar_question
 
Whitepaper: Exadata Consolidation Success Story
Whitepaper: Exadata Consolidation Success StoryWhitepaper: Exadata Consolidation Success Story
Whitepaper: Exadata Consolidation Success Story
 
Oracle ebs capacity_analysisusingstatisticalmethods
Oracle ebs capacity_analysisusingstatisticalmethodsOracle ebs capacity_analysisusingstatisticalmethods
Oracle ebs capacity_analysisusingstatisticalmethods
 
Awr + 12c performance tuning
Awr + 12c performance tuningAwr + 12c performance tuning
Awr + 12c performance tuning
 
Adaptive Query Optimization
Adaptive Query OptimizationAdaptive Query Optimization
Adaptive Query Optimization
 
An Introduction to Netezza
An Introduction to NetezzaAn Introduction to Netezza
An Introduction to Netezza
 
Oracle Trace File Analyzer - What's New in 12.2.1.1.0
Oracle Trace File Analyzer - What's New in 12.2.1.1.0Oracle Trace File Analyzer - What's New in 12.2.1.1.0
Oracle Trace File Analyzer - What's New in 12.2.1.1.0
 
Oracle real application_cluster
Oracle real application_clusterOracle real application_cluster
Oracle real application_cluster
 

Similar to Strata + Hadoop 2015 Slides

Systems oracle overview_hardware
Systems oracle overview_hardwareSystems oracle overview_hardware
Systems oracle overview_hardware
Fran Navarro
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Databricks
 
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red_Hat_Storage
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Databricks
 
Resume_CQ_Edward
Resume_CQ_EdwardResume_CQ_Edward
Resume_CQ_Edwardcaiqi wang
 
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI ConvergenceDAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
inside-BigData.com
 
Exadata
ExadataExadata
Exadata
vkv_vkv
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
Facultad de Informática UCM
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red_Hat_Storage
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Community
 
Using SAS GRID v 9 with Isilon F810
Using SAS GRID v 9 with Isilon F810Using SAS GRID v 9 with Isilon F810
Using SAS GRID v 9 with Isilon F810
Boni Bruno
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Ceph Community
 
Spring Hill (NNP-I 1000): Intel's Data Center Inference Chip
Spring Hill (NNP-I 1000): Intel's Data Center Inference ChipSpring Hill (NNP-I 1000): Intel's Data Center Inference Chip
Spring Hill (NNP-I 1000): Intel's Data Center Inference Chip
inside-BigData.com
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
javier ramirez
 
Výhody a benefity nasazení Oracle Database Appliance
Výhody a benefity nasazení Oracle Database ApplianceVýhody a benefity nasazení Oracle Database Appliance
Výhody a benefity nasazení Oracle Database Appliance
MarketingArrowECS_CZ
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red_Hat_Storage
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Databricks
 

Similar to Strata + Hadoop 2015 Slides (20)

Systems oracle overview_hardware
Systems oracle overview_hardwareSystems oracle overview_hardware
Systems oracle overview_hardware
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
 
Resume_CQ_Edward
Resume_CQ_EdwardResume_CQ_Edward
Resume_CQ_Edward
 
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI ConvergenceDAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
 
Exadata
ExadataExadata
Exadata
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK
 
optimizing_ceph_flash
optimizing_ceph_flashoptimizing_ceph_flash
optimizing_ceph_flash
 
No[1][1]
No[1][1]No[1][1]
No[1][1]
 
Using SAS GRID v 9 with Isilon F810
Using SAS GRID v 9 with Isilon F810Using SAS GRID v 9 with Isilon F810
Using SAS GRID v 9 with Isilon F810
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
Spring Hill (NNP-I 1000): Intel's Data Center Inference Chip
Spring Hill (NNP-I 1000): Intel's Data Center Inference ChipSpring Hill (NNP-I 1000): Intel's Data Center Inference Chip
Spring Hill (NNP-I 1000): Intel's Data Center Inference Chip
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
Výhody a benefity nasazení Oracle Database Appliance
Výhody a benefity nasazení Oracle Database ApplianceVýhody a benefity nasazení Oracle Database Appliance
Výhody a benefity nasazení Oracle Database Appliance
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 

Recently uploaded

Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 

Recently uploaded (20)

Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 

Strata + Hadoop 2015 Slides

  • 1. Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC
  • 2. System Technologies & Optimization (STO) 2 Agenda Quick Overview of Impala Design Challenges of an Impala Deployment Case Study: Use Simulation-Based Approach to Design and Optimize an Impala Cluster What’s in side: Intel Cofluent Technology for Big Data
  • 3. System Technologies & Optimization (STO) 3 Impala Overview  Open-ource MPP query execution engine  Built natively for Hadoop  Efficiently access data stored in Hadoop using SQL  Piplined execution mode enables fast data processing speed
  • 4. System Technologies & Optimization (STO) 4 Design Challenges of an Impala Cluster – H/W Meet Performance Requirements Plan For the Future Not Over Provisioning 10 GB 50GB 1TB 5TB 10TB
  • 5. System Technologies & Optimization (STO) 5 Example: Cluster Sizing Requirements: a deep data analytic query over historical data should response within 10 seconds
  • 6. System Technologies & Optimization (STO) 6 Example: Storage Choice of One Use Case ~0.0448%  In general, SSD is faster than HDD, but there’re exceptions
  • 7. System Technologies & Optimization (STO) • No impact on the illustrated workload running on the Text formatted table • Scaling well when running on the Parquet formatted table 7 Example: CPU Frequency
  • 8. System Technologies & Optimization (STO) 8 Design Challenges of an Impala Cluster – S/W HDFS Cache HDFS Block Size Parquet Row Group Size Software Configuration Options.... .... ....
  • 9. System Technologies & Optimization (STO) 9 Example: HDFS Caching
  • 10. System Technologies & Optimization (STO) 10 Design Challenge Summary We have talked about deployment challenges, in terms of: • hardware selections and settings • software configuration choices There’s NO ONE SIZE FIT-ALL solution to the design challenges one would face with when deploying a system for production. Efficient Way to Predict System Performance? Current Approach
  • 11. System Technologies & Optimization (STO) Simulation Approach Deploy on Experimental Cluster Generate Simulation Report Collect and Analyze System Log Simulation Plan Change H/W config Change H/W knobs Adjust WL setting
  • 12. System Technologies & Optimization (STO) 12 • Impala Query Execution Simulation • Query Planning Flow • Plan Nodes, Plan Fragments, Execution Nodes Geneation • Task Scheduling and Distribution • Data Processing Flow (Pull & Push) • Data Distribution (Data Skew and Partitioning) • Disk IO Scheduling and Scan Operations • Execution nodes Impala Simulator Overview
  • 13. System Technologies & Optimization (STO) 13 One Banking Use Case Study • Offline Customer Account Historical Data Analysis – Complex and Deep Analytic Queries – Low Latency Interactive Queries – Reporting Queries • Initially evaluated on Hive, now Impala
  • 14. System Technologies & Optimization (STO) 14 Step1: Deploy an Experimental Cluster • Deploy a 4-node cluster • Small scale of the data
  • 15. System Technologies & Optimization (STO) 15 Step2: Collect Simulation Input Hardware Configurations – Node Count – Processor – Storage – Network – Memory Software Configurations – HDFS – Impala File Format Table / Column Metadata – COMPUTE STATS – SHOW TABLE STATS – DESC FORMATTED – SHOW COLUMN STATS Query Profile - PROFILE Tuple Descriptors – Impala Daemon Log
  • 16. System Technologies & Optimization (STO) 16 Example: Configure Table Meta Data
  • 17. System Technologies & Optimization (STO) 17 Step 3: Baseline Validation on Experimental Cluster
  • 18. System Technologies & Optimization (STO) 18 Not just query execution time. We also compare with Impala Log File to check the duration of each stage • disk-io-mgr.cc: disk id (1) reading for .... • exchange.cc: #rows ... instance_id = ... HashJoin Build Phase HashJoin Probe Phase Aggregation Hdfs Scan Operation Exchange Execution Node Disk Worker 4 Disk Worker 0
  • 19. System Technologies & Optimization (STO) 19 Step 4: From Experimental Cluster to Production Cluster • We have completed baseline verification on an experimental cluster • Performance prediction for the production cluster • Simulation assumptions: • upper- and lower- data distribution boundaries • small scale of the data
  • 20. System Technologies & Optimization (STO) 20 Step 5: Simulation Plan for Production Cluster File Format Compression Partition Text GZIP PartitionedAvro Snappy No Partition Cached No Cache Cache Parquet No Compression ... ... CPU Freq Netw ork Cluster Size 2.7Gz ... 42.4Gz 10GbE 2 SDD HDD Disk Type 2.1Gz 1GbE 6 ... Software Configuration Matrix Hardware Configuration Matrix
  • 21. System Technologies & Optimization (STO) 21 Software Performance Predication
  • 22. System Technologies & Optimization (STO) 22 Software Performance Predication > 40GB data to cache
  • 23. System Technologies & Optimization (STO) 23 Cache Impact on Text Formatted Data With Cache Without Cache HdfsScanNode finishes at around 6 sec HdfsScanNode finishes at around 12 sec
  • 24. System Technologies & Optimization (STO) 24 Cache Impact on Text Formatted Data Block for a short period waiting for RowBatches Execution nodes are busy processing RowBatches
  • 25. System Technologies & Optimization (STO) 25 Cache Impact on Parquet Formatted Data With Cache Without Cache Fast Scan, CPU Bound
  • 26. System Technologies & Optimization (STO) 26 Cache Impact on Parquet Formatted Data CPU Bound,Scan Speed Does Not Have Impact on Overall Performance of Query Execution.
  • 27. System Technologies & Optimization (STO) GZIPParquet Partitioned Cached 14.45% 2.74% 0.49%-9.22% 27 Reporting Workload Deep Analytic Workload Baseline 10x Files to Scan CPU Intensive Avro Snappy Partitioned Cached 1.1% 7.37% 7.94% 4.62% Text No Compression No Partition No Cache Software Configuration Recommandation
  • 28. System Technologies & Optimization (STO) Hardware Performance Predication
  • 29. System Technologies & Optimization (STO) 29 Hardware Performance Predication 1.00 0.49 0.44 0.43 0.44 0.44 2 Nodes 4 Nodes 6 Nodes 8 Nodes 16 Nodes 20 Nodes Baseline Network Transfer Cost: 470.7 MS Network Transfer Cost: 494.2 MS
  • 30. System Technologies & Optimization (STO) 30 Hardware Performance Predication Expected Response Time
  • 31. System Technologies & Optimization (STO) Overall Recommendation Execution Time (Recommanded): ~12.4 seconds Execution Time (Baseline): ~63.3 seconds 1.8Gz 256 MB No Compression Text 6 HDD 10GbE 4 Nodes 80% Cluster Size < 4x, 8 Nodes < 10x, 16 Nodes > 10x, 20 Nodes
  • 32. System Technologies & Optimization (STO) 32 What’s Inside
  • 33. System Technologies & Optimization (STO) SCALE UP WITH CONFIDENCE: Simulate to determine the minimum cost to meet your future demand FASTER CLUSTER DEPLOYMENT: Explore deployment options and meet performance goals OPTIMIZE CLUSTERS: Find performance bottlenecks and optimize software operation Intel® CoFluent™ Technology for Big Data
  • 34. System Technologies & Optimization (STO) Intel® CoFluent™ Studio Based Simulation Enables fast “What if?” analysis with a virtual system
  • 35. System Technologies & Optimization (STO) Layered Simulation Architecture H/W Resource Monitoring and Performance Library CPU Memory Storage Ethernet … Discrete Events Simulation Kernel on SystemC Dynamic S/W & H/W Mapping S/W Stack HBaseSpark M/R HDFS Impala OS JVM … System Topology Role Assignment Build a cluster
  • 36. System Technologies & Optimization (STO) Software Stack Coverage YARN
  • 37. System Technologies & Optimization (STO) Hardware Coverage Validated: 50 Nodes SSD & HDD 1GbE & 10GbE Rack Scale Architecture Pooled Compute Pooled Memory Pooled I/O
  • 38. System Technologies & Optimization (STO) 38 Simulation Accuracy High Simulation Accuracy is achieved for Big Data applications running on different cluster size, hardware configurations and software stacks.
  • 39. System Technologies & Optimization (STO) Fast Simulation 7 18 36 71 2 6 14 29 20 50 100 200 NUMBER OF CONCURRENT UPLOADING REQUESTS Simulation vs. Real Time in minutes Hardware - 4 node Cluster (min) Simulation Speed - Lenovo T420 (min) Abstract Modeling Event Driven Simulation
  • 40. System Technologies & Optimization (STO) Host machine to run simulations
  • 41. System Technologies & Optimization (STO) 41 Call to Actions Visit cofluent.intel.com for more information Request white papers Various customer success stories and use cases available – Optimize a 50-node Hive/MR Cluster – Predict the scalability of a large HBase Cluster – Software Parameter tunings for Spark Applications – … Demo in the showcase – Intel booth
  • 42. System Technologies & Optimization (STO) cofluent.intel.com
  • 43. System Technologies & Optimization (STO) Legal Notices and Disclaimers Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. Intel, CoFluent, Xeon, and the Intel logo are trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others. © 2015 Intel Corporation.
  • 44. System Technologies & Optimization (STO) Risk FactorsThe above statements and any others in this document that refer to plans and expectations for the second quarter, the year and the future are forward- looking statements that involve a number of risks and uncertainties. Words such as "anticipates," "expects," "intends," "plans," "believes," "seeks," "estimates," "may," "will," "should" and their variations identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements. Many factors could affect Intel's actual results, and variances from Intel's current expectations regarding such factors could cause actual results to differ materially from those expressed in these forward-looking statements. Intel presently considers the following to be important factors that could cause actual results to differ materially from the company's expectations. Demand for Intel's products is highly variable and could differ from expectations due to factors including changes in business and economic conditions; consumer confidence or income levels; the introduction, availability and market acceptance of Intel's products, products used together with Intel products and competitors' products; competitive and pricing pressures, including actions taken by competitors; supply constraints and other disruptions affecting customers; changes in customer order patterns including order cancellations; and changes in the level of inventory at customers. Intel's gross margin percentage could vary significantly from expectations based on capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; changes in revenue levels; segment product mix; the timing and execution of the manufacturing ramp and associated costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials or resources; and product manufacturing quality/yields. Variations in gross margin may also be caused by the timing of Intel product introductions and related expenses, including marketing expenses, and Intel's ability to respond quickly to technological developments and to introduce new products or incorporate new features into existing products, which may result in restructuring and asset impairment charges. Intel's results could be affected by adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Results may also be affected by the formal or informal imposition by countries of new or revised export and/or import and doing-business regulations, which could be changed without prior notice. Intel operates in highly competitive industries and its operations have high costs that are either fixed or difficult to reduce in the short term. The amount, timing and execution of Intel's stock repurchase program could be affected by changes in Intel's priorities for the use of cash, such as operational spending, capital spending, acquisitions, and as a result of changes to Intel's cash flows or changes in tax laws. Product defects or errata (deviations from published specifications) may adversely impact our expenses, revenues and reputation. Intel's results could be affected by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust, disclosure and other issues. An unfavorable ruling could include monetary damages or an injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting Intel's ability to design its products, or requiring other remedies such as compulsory licensing of intellectual property. Intel's results may be affected by the timing of closing of acquisitions, divestitures and other significant transactions. A detailed discussion of these and other factors that could affect Intel's results is included in Intel's SEC filings, including the company's most recent reports on Form 10-Q, Form 10-K and earnings release. Rev. 4/14/15