SlideShare a Scribd company logo
Demo - Smart City Use-case
Using ODPi Hadoop, Spark, H2O and Sparkling water
Ganesh Raju
ENGINEERS AND DEVICES
WORKING TOGETHER
● Simplify & standardize big data ecosystem with a common reference
specification and test suites.
● Reduces cost and complexity and accelerates the development of Big Data
solutions.
● Cross-compatibility between different distributions of Hadoop and big data
technologies
● Has two stacks: Runtime and Operations
● V2.0 alpha release coming soon
● Linaro is a member of ODPi
www.odpi.org
ODPi
ENGINEERS AND DEVICES
WORKING TOGETHER
● Distributed and fast in-memory data processing engine
● Provides development APIs to efficiently execute iterative streaming, machine
learning or SQL workloads
● Spark was developed as an alternative approach to Map Reduce with easy of
use in mind.
● Code in Java, Scala, or Python.
Spark
ENGINEERS AND DEVICES
WORKING TOGETHER
● H2O is a in-memory user friendly machine learning API
● Compatible with Hadoop and Spark
● Spark + H2O is Sparkling Water
● Sparkling Water allows to combine fast & scalable machine learning algorithms
of H2O with high performance distributed processing capabilities of Spark
engine.
● Spark’s RDD and DataFrame and H2O’s H2OFrame are interoperable
● Users can utilize H2O Flow UI to drive Scala / R / Python computation from
Spark
H2O Sparkling Water
ENGINEERS AND DEVICES
WORKING TOGETHER
● Utilizing ODPi v1 based Native Hadoop, Spark, H2O Sparkling Water, H2O flow.
● All Compiled on ARM - ODPi Hadoop 2.7, Spark 1.6 with Scala 2.10 (Scala 2.11 is
not supported with SparklingWater)
● 3 node cluster running on Linaro Developer Cloud - HP MoonShot machines
● Dataset files stored in HDFS.
● Spark utilizing Yarn for Resource manager.
● H2O Sparkling water utilizing Spark as execution Engine.
● H2O Flow utilizing Spark SQL API and scala code
● .csv data -> HDFS -> Spark RDD -> H2O H2OFrame
https://wiki.linaro.org/LEG/Engineering/BigData
Demo
Benchmarking Big Data
Ganesh Raju and Naresh Bhat
ENGINEERS AND DEVICES
WORKING TOGETHER
● Various Benchmarking Tools
● Types of Benchmarks and standards
● Challenges of BigData benchmarking on ARM
● Some of the tools that we will be covering are TPC (Transaction Processing
Performance Council) based TPCx-HS, TPC-DS, TPC-H benchmark, HiBench
(TestDFSIO), Spark-Bench for Apache Spark, MRBench for Mapreduce,
NNBench for HDFS...etc
Abstract
ENGINEERS AND DEVICES
WORKING TOGETHER
● Measure performance and scale
● Simulate higher load
○ Find bottlenecks/limits
● Evaluate different hardware/software
○ OS, Java, VM.
○ Hadoop, Spark, Pig, Hive..
● Validate reliability
● Validate assumptions / Configurations
● Compare two different deployments
● Performance tuning
Why Benchmarking ..?
ENGINEERS AND DEVICES
WORKING TOGETHER
Challenges of BigData benchmarking
● System Diversity
○ Variety of Solutions - Data Read, I/O, Streaming, Data warehousing,
Machine Learning
● Rapid Data Evolution - Velocity.
● System and Data Scale
● System Complexity
○ Multiple pipelines (layers of Transformations)
ENGINEERS AND DEVICES
WORKING TOGETHER
Types of benchmarks and standards
● Micro benchmarks: To evaluate specific lower-level, system operations
○ E.g. Hadoop Workload Examples (sort, grep, wordcount and Terasort,
Gridmix, Pigmix), HiBench, HDFS DFSIO, AMP Lab Big Data Benchmark
● Functional/Component benchmarks: Specific to low level function
○ E.g. Basic SQL queries (select, join, etc.,)
○ Synthetic benchmarks
● Application level
○ Bigbench
○ Spark bench
ENGINEERS AND DEVICES
WORKING TOGETHER
Benchmark Efforts -
Microbenchmarks
Workloads Software
Stacks
Metrics
HiBench Sort, WordCount, TeraSort, PageRank, K-means, Bayes
classification, Index
Hadoop
and Hive
Execution
Time,
Throughput,
resource
utilization
DFSIO Generate, read, write, append, and remove data for
MapReduce jobs
Hadoop Execution
Time,
Throughput
AMPLab benchmark Part of CALDA workloads (scan, aggregate and join) and
PageRank
Hive, Tez Execution
Time
ENGINEERS AND DEVICES
WORKING TOGETHER
Benchmark
Efforts - TPC
Workloads Software
Stacks
Metrics
TPCx-HS HSGen, HSData, Check, HSSort and HSValidate Hadoop Performance,
price and energy
TPC-H Datawarehousing operations Hive, Pig Execution Time,
Throughput
TPC-DS Decision support benchmark
Data loading, queries and maintenance
Hive, Pig Execution Time,
Throughput
ENGINEERS AND DEVICES
WORKING TOGETHER
Benchmark
Efforts -
Synthetic
Workloads Software Stacks Metrics
SWIM Synthetic user generated MapReduce jobs of reading,
writing, shuffling and sorting
Hadoop Multiple metrics
GridMix Synthetic and basic operations to stress test job
scheduler and compression and decompression
Hadoop Memory,
Execution Time,
Throughput
PigMix 17 Pig specific queries Hadoop, Pig Execution Time
MRBench MapReduce benchmark as a complementary to TeraSort
- Datawarehouse operations with 22 TPC-H queries
Hadoop Execution Time
NNBench and
NNBenchWithO
utMR
Load testing namenode and HDFS I/O with small
payloads
Hadoop I/O
SparkBench CPU, memory and shuffle and IO intensive workloads.
Machine Learning, Streaming, Graph Computation and
SQL Workloads
Spark Execution Time,
Data process
rate
BigBench Interactive-based queries based on synthetic data Hadoop, Spark Execution Time
ENGINEERS AND DEVICES
WORKING TOGETHER
Benchmark
Efforts
Workloads Software Stacks Metrics
BigDataBench 1. Micro Benchmarks (sort, grep, WordCount);
2. Search engine workloads (index, PageRank);
3. Social network workloads (connected components (CC),
K-means and BFS);
4. E-commerce site workloads (Relational database queries
(select, aggregate and join), collaborative filtering (CF) and
Naive Bayes;
5. Multimedia analytics workloads (Speech Recognition, Ray
Tracing, Image Segmentation, Face Detection);
6. Bioinformatics workloads
Hadoop,
DBMSs, NoSQL
systems, Hive,
Impala, Hbase,
MPI, Libc, and
other real-time
analytics
systems
Throughput,
Memory, CPU
(MIPS, MPKI -
Misses per
instruction)
ENGINEERS AND DEVICES
WORKING TOGETHER
Hadoop benchmark and Test tool
● Hadoop distribution comes with a number of benchmarks
● TestDFSIO, nnbench, mrbench are in hadoop-*test*.jar
● TeraGen, TeraSort, TeraValidate are in hadoop-*examples*.jar
● You can check it using the command
$ cd /usr/local/hadoop
$ bin/hadoop jar hadoop-*test*.jar
$ bin/hadoop jar hadoop-*examples*.jar
● While running the benchmarks you might want to use time command which
measure the elapsed time. This saves you the hassle of navigating to the
hadoop JobTracker interface. The relevant metric is real value in the first row.
$ time hadoop jar hadoop-*examples*.jar ...
[...]
real 9m15.510s
user 0m7.075s
sys 0m0.584s
ENGINEERS AND DEVICES
WORKING TOGETHER
TeraGen, TeraSort and TeraValidate
● This is a most well known Hadoop benchmark
● The TeraSort is to sort the data as fast as possible
● This test suite combines HDFS and mapreduce layers of a hadoop cluster
● TeraSort benchmark consists of 3 steps
○ Generate input via TeraGen
○ Run TeraSort on input data
○ Validate sorted output data via TeraValidate
https://wiki.linaro.org/LEG/Engineering/BigData/HadoopBuildInstallAndRunGuide
ENGINEERS AND DEVICES
WORKING TOGETHER
HiBench
● Contains 9 typical Hadoop and Spark workloads (including micro benchmarks, HDFS benchmarks,
web search benchmarks, machine learning benchmarks using Mahout, and data analytics
benchmarks)
● Sort, WordCount, TeraSort, TestDFSIO, Nutch indexing (search indexing using Nutch engine),
PageRank (An implementation of Google’s Web page ranking algorithm), hivebench
● Uses zlib compression for input and output
● Metrics: Time (sec) & Throughput (Bytes/Sec), Memory partitions, parallelism,
● Cons: Lack of AARCH bits, Lack of documentations
https://wiki.linaro.org/LEG/Engineering/BigData/HiBench
ENGINEERS AND DEVICES
WORKING TOGETHER
TestDFSIO
● It is part of hadoop-mapreduce-client-jobclient.jar
● Stress test I/O performance (throughput and latency) on a clustered setup.
● This test will shake out the hardware, OS and Hadoop setup on your cluster
machines (NameNode/DataNode)
● The tests are run as a MapReduce job using 1:1 mapping (1 map / file)
● Helpful to discover performance bottlenecks in your network
● Benchmark write test followed up with read test
● Use -write for write tests and -read for read tests.
● The results stored in TestDFSIO_results.log. Use -resFile to choose different file
name
ENGINEERS AND DEVICES
WORKING TOGETHER
Hive Testbench
● Based on TPC-H and TPC-DS benchmarks
● Experiment Apache Hive at any data scale
● Contains data generator and set of queries
● Test the basic Hive performance on large data sets
https://wiki.linaro.org/LEG/Engineering/BigData/HiveTestBench
ENGINEERS AND DEVICES
WORKING TOGETHER
MR(Map Reduce) Benchmark for MR
● Loops a small job number of times
● Checks whether small job runs are responsive and running efficiently on your
cluster
● Puts focus on MapReduce layer as its impact on the HDFS layer is very limited
● The multiple parallel MRBench issue is resolved. Hence you can run it from
different boxes
● Test command to run 50 small test jobs
$ hadoop jar hadoop-*test*.jar mrbench -numRuns 50
● Exemplary output, which means in 31 sec the job finished
DataLines Maps Reduces AvgTime (milliseconds)
1 2 1 31414
ENGINEERS AND DEVICES
WORKING TOGETHER
NNBench and NNBenchWithoutMR
● Load testing NameNode through continuous read, write, rename and delete
operations on small files
● Stress tests HDFS (I/O)
● To increase stress, multiple instances of NNBenchWithoutMR can be run
simultaneously from several machines or increase map tasks for NNBench
● All write tests are run then followed by read tests
● The test command: The below command will run a NameNode benchmark that
creates 1000 files using 12 maps and 6 reducers.
$ hadoop jar hadoop-*test*.jar nnbench -operation create_write 
-maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 
-replicationFactorPerFile 3 -readFileAfterOpen true 
-baseDir /benchmarks/NNBench-`hostname -s`
ENGINEERS AND DEVICES
WORKING TOGETHER
TPC Benchmark
● TPCx-HS - https://wiki.linaro.org/LEG/Engineering/BigData/TPCxHS
○ Currently facing problems with cluster shell configuration
● TPC-H
○ TPC-H benchmark focuses on ad-hoc queries
● TPC-DS
○ “the” standard benchmark for decision support
● TPC-C
○ Is an on-line transaction processing (OLTP) benchmark
ENGINEERS AND DEVICES
WORKING TOGETHER
TPCx-HS Benchmark
X: Express, H: Hadoop, S: Sort
The TPCx-HS kit contains
● TPCx-HS specification documentation
● TPCx-HS User's guide documentation
● Scripts to run benchmarks
● Java code to execute the benchmark load
TPCx-HS Execution
● A valid run consists of 5 separate phases run sequentially with overlap in their execution
● The benchmark test consists of 2 runs (Run with lower and higher TPCx-HS Performance Metric)
● No configuration or tuning changes or reboot are allowed between the two runs
ENGINEERS AND DEVICES
WORKING TOGETHER
TPC vs SPEC models
TPC model
● Specification based
● Performance, Price, energy in one
benchmark
● End-to-End
● Multiple tests (ACID, Load)
● Independent Review
● Full disclosure
● TPC Technology conference
SPEC model
● Kit based
● Performance and energy in
separate benchmarks
● Server centric
● Single test
● Summary disclosure
● SPEC research group ICPE
ENGINEERS AND DEVICES
WORKING TOGETHER
BigBench
● BigBench is a joint effort with partners in industry and academia on creating a comprehensive
and standardized BigData benchmark.
● BigBench builds upon and borrows elements from existing benchmarking efforts (such as
TPC-xHS, GridMix, PigMix, HiBench, Big Data Benchmark, YCSB and TPC-DS).
● BigBench is a specification-based benchmark with an open-source reference implementation
kit.
● As a specification-based benchmark, it would be technology-agnostic and provide the
necessary formalism and flexibility to support multiple implementations.
● Focused around execution time calculation
● Consists of 30 queries/workloads (10 of them are from TPC)
● Drawback - it is structured-data-intensive
ENGINEERS AND DEVICES
WORKING TOGETHER
Spark Bench for Apache Spark
● Build on ARM works
● FAIL: When spark bench examples are run, a KILL signal is observed which
terminates all workers.
● This is still under investigation as there are no useful logs to debug. No proper
error description and lack of documentation is a challenge.
● A ticket is already filed on spark bench git which is unresolved.
● Con: Lack of documentation.
ENGINEERS AND DEVICES
WORKING TOGETHER
GridMix
● Mix of Synthetic Mapreduce jobs (sorting text data and SequenceFiles)
● Evaluate MapReduce and HDFS performance
● The input file needs to be in JSON format
● Jobs can be either LOADJOB (trace of history logs using Rumen) or SLEEPJOB (A synthetic job where
each task does *nothing* but sleep for a certain duration)
● Jobs can be run in STRESS, REPLAY or SERIAL mode
● You can emulate number of users, number of job queries and resource usage (CPU, memory, JVM
heap)
● Basic command line usage: (Provided as part of hadoop command)
$ hadoop gridmix [-generate <size>] [-users <users-list>] <iopath> <trace>
● Con: Challenging to explore the performance impact of combining or separating workloads, e.g.,
through consolidating from many clusters.
ENGINEERS AND DEVICES
WORKING TOGETHER
PigMix
● PigMix is a set of queries used test Apache Pig performance
● There are queries that test latency (How long it takes to run this query ?)
● Queries that test scalability (How many fields or records can ping handle before
it fails ?)
● Usage: Run the below commands from pig home
ant -Dharness.hadoop.home=$HADOOP_HOME pigmix-deploy (generate test dataset)
ant -Dharness.hadoop.home=$HADOOP_HOME pigmix (run the PigMix benchmark)
ENGINEERS AND DEVICES
WORKING TOGETHER
SWIM(Statistical Workload Injector for MapReduce)
● Enables rigorous performance measurement of MapReduce systems
● Contains suites of workloads of thousands of jobs, with complex data, arrival,
and computation patterns
● Informs both highly targeted, workload specific optimizations
● Highly recommended for MapReduce operators
● Performance measurement
https://github.com/SWIMProjectUCB/SWIM/wiki/Performance-measurement-by-ex
ecuting-synthetic-or-historical-workloads
ENGINEERS AND DEVICES
WORKING TOGETHER
AmpLab
● The Big Data Benchmark from AMPLab, UC Berkeley provides quantitative and qualitative
comparisons of five systems
○ Redshift – a hosted MPP database offered by Amazon.com based on the ParAc
warehouse
○ Hive – a Hadoop-based data warehousing system
○ Shark – a Hive-compatible SQL engine which runs on top of the Spark computing framework
○ Impala – a Hive-compatible* SQL engine with its own MPP-like execution engine
○ Stinger/Tez – Tez is a next generation Hadoop execution engine used in Spark
● This benchmark measures response time on a handful of relational queries: scans, aggregations, joins,
and UDF’s, across different data sizes.
ENGINEERS AND DEVICES
WORKING TOGETHER
BigDataBench
BigDataBench is a benchmark suite for scale-out workloads, different from SPEC
CPU (sequential workloads), and PARSEC (multithreaded workloads). Currently, it
simulates five typical and important big data applications: search engine, social
network, e-commerce, multimedia data analytics, and bioinformatics.
Includes 15 real-world data sets, and 34 big data workloads.
ENGINEERS
AND DEVICES
WORKING
TOGETHER
References
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-21.pdf
Terasort, TestDFSIO, NNBench, MRBench
https://wiki.linaro.org/LEG/Engineering/BigData
https://wiki.linaro.org/LEG/Engineering/BigData/HadoopTuningGuide
https://wiki.linaro.org/LEG/Engineering/BigData/HadoopBuildInstallAndRunGuide
http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasor
t-testdfsio-nnbench-mrbench/
GridMix3, PigMix, HiBench, TPCx-HS, SWIM, AMPLab, BigBench
https://hadoop.apache.org/docs/current/hadoop-gridmix/GridMix.html
https://cwiki.apache.org/confluence/display/PIG/PigMix
https://wiki.linaro.org/LEG/Engineering/BigData/HiBench
https://wiki.linaro.org/LEG/Engineering/BigData/TPCxHS
https://github.com/SWIMProjectUCB/SWIM/wiki
https://github.com/amplab
https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench
Thank you
ganesh.raju@linaro.org
naresh.bhat@linaro.org
#LAS16
For further information: www.linaro.org
LAS16 keynotes and videos on: connect.linaro.org

More Related Content

What's hot

A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
J On The Beach
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
Carol McDonald
 
MapReduce and NoSQL
MapReduce and NoSQLMapReduce and NoSQL
MapReduce and NoSQL
Aaron Cordova
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
inside-BigData.com
 
Lambda Architecture using Google Cloud plus Apps
Lambda Architecture using Google Cloud plus AppsLambda Architecture using Google Cloud plus Apps
Lambda Architecture using Google Cloud plus Apps
Simon Su
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
Carol McDonald
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
t_ivanov
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
Carol McDonald
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
Glenn K. Lockwood
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
Data Works MD
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
Nicola Cadenelli
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
Yahoo Developer Network
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
Muhaza Liebenlito
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
Yahoo Developer Network
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsOptimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Chris Fregly
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
David Groozman
 

What's hot (20)

A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
MapReduce and NoSQL
MapReduce and NoSQLMapReduce and NoSQL
MapReduce and NoSQL
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
Lambda Architecture using Google Cloud plus Apps
Lambda Architecture using Google Cloud plus AppsLambda Architecture using Google Cloud plus Apps
Lambda Architecture using Google Cloud plus Apps
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 
HW09 Hadoop Vaidya
HW09 Hadoop VaidyaHW09 Hadoop Vaidya
HW09 Hadoop Vaidya
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsOptimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 

Similar to Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016

Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream DataDataWorks Summit
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
bigdatagurus_meetup
 
Ml2
Ml2Ml2
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
DataWorks Summit
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Data Con LA
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
InMobi Technology
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Carol McDonald
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Paco Nathan
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
Hortonworks
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
Deadline-aware MapReduce Job Scheduling with Dynamic Resource Availability
Deadline-aware MapReduce Job Scheduling with Dynamic Resource AvailabilityDeadline-aware MapReduce Job Scheduling with Dynamic Resource Availability
Deadline-aware MapReduce Job Scheduling with Dynamic Resource Availability
JAYAPRAKASH JPINFOTECH
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
Data Con LA
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
Geoffrey Fox
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
Geoffrey Fox
 

Similar to Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016 (20)

Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Ml2
Ml2Ml2
Ml2
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Deadline-aware MapReduce Job Scheduling with Dynamic Resource Availability
Deadline-aware MapReduce Job Scheduling with Dynamic Resource AvailabilityDeadline-aware MapReduce Job Scheduling with Dynamic Resource Availability
Deadline-aware MapReduce Job Scheduling with Dynamic Resource Availability
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 

More from Ganesh Raju

Technology trends, disruptions and Opportunities
Technology trends, disruptions and OpportunitiesTechnology trends, disruptions and Opportunities
Technology trends, disruptions and Opportunities
Ganesh Raju
 
ODPi (Open Data Platform Initiative) - Standardizing Hadoop Ecosystem: Linaro...
ODPi (Open Data Platform Initiative) - Standardizing Hadoop Ecosystem: Linaro...ODPi (Open Data Platform Initiative) - Standardizing Hadoop Ecosystem: Linaro...
ODPi (Open Data Platform Initiative) - Standardizing Hadoop Ecosystem: Linaro...
Ganesh Raju
 
Apache Ambari on ARM Server - Linaro Connect
Apache Ambari on ARM Server - Linaro ConnectApache Ambari on ARM Server - Linaro Connect
Apache Ambari on ARM Server - Linaro Connect
Ganesh Raju
 
Exploring Github Data with Apache Drill on ARM64
Exploring Github Data with Apache Drill on ARM64 Exploring Github Data with Apache Drill on ARM64
Exploring Github Data with Apache Drill on ARM64
Ganesh Raju
 
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data Everywhere
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data EverywhereApache Bigtop and ARM64 / AArch64 - Empowering Big Data Everywhere
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data Everywhere
Ganesh Raju
 
State of Big Data on ARM64 / AArch64 - Apache Bigtop
State of Big Data on ARM64 / AArch64 - Apache BigtopState of Big Data on ARM64 / AArch64 - Apache Bigtop
State of Big Data on ARM64 / AArch64 - Apache Bigtop
Ganesh Raju
 
ODPi (Open Data Platform Initiative) - Linaro Connect
ODPi (Open Data Platform Initiative) - Linaro ConnectODPi (Open Data Platform Initiative) - Linaro Connect
ODPi (Open Data Platform Initiative) - Linaro Connect
Ganesh Raju
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
Technology Trends, Disruptions and Opportunities
Technology Trends, Disruptions and OpportunitiesTechnology Trends, Disruptions and Opportunities
Technology Trends, Disruptions and Opportunities
Ganesh Raju
 

More from Ganesh Raju (9)

Technology trends, disruptions and Opportunities
Technology trends, disruptions and OpportunitiesTechnology trends, disruptions and Opportunities
Technology trends, disruptions and Opportunities
 
ODPi (Open Data Platform Initiative) - Standardizing Hadoop Ecosystem: Linaro...
ODPi (Open Data Platform Initiative) - Standardizing Hadoop Ecosystem: Linaro...ODPi (Open Data Platform Initiative) - Standardizing Hadoop Ecosystem: Linaro...
ODPi (Open Data Platform Initiative) - Standardizing Hadoop Ecosystem: Linaro...
 
Apache Ambari on ARM Server - Linaro Connect
Apache Ambari on ARM Server - Linaro ConnectApache Ambari on ARM Server - Linaro Connect
Apache Ambari on ARM Server - Linaro Connect
 
Exploring Github Data with Apache Drill on ARM64
Exploring Github Data with Apache Drill on ARM64 Exploring Github Data with Apache Drill on ARM64
Exploring Github Data with Apache Drill on ARM64
 
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data Everywhere
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data EverywhereApache Bigtop and ARM64 / AArch64 - Empowering Big Data Everywhere
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data Everywhere
 
State of Big Data on ARM64 / AArch64 - Apache Bigtop
State of Big Data on ARM64 / AArch64 - Apache BigtopState of Big Data on ARM64 / AArch64 - Apache Bigtop
State of Big Data on ARM64 / AArch64 - Apache Bigtop
 
ODPi (Open Data Platform Initiative) - Linaro Connect
ODPi (Open Data Platform Initiative) - Linaro ConnectODPi (Open Data Platform Initiative) - Linaro Connect
ODPi (Open Data Platform Initiative) - Linaro Connect
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
Technology Trends, Disruptions and Opportunities
Technology Trends, Disruptions and OpportunitiesTechnology Trends, Disruptions and Opportunities
Technology Trends, Disruptions and Opportunities
 

Recently uploaded

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 

Recently uploaded (20)

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 

Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016

  • 1. Demo - Smart City Use-case Using ODPi Hadoop, Spark, H2O and Sparkling water Ganesh Raju
  • 2. ENGINEERS AND DEVICES WORKING TOGETHER ● Simplify & standardize big data ecosystem with a common reference specification and test suites. ● Reduces cost and complexity and accelerates the development of Big Data solutions. ● Cross-compatibility between different distributions of Hadoop and big data technologies ● Has two stacks: Runtime and Operations ● V2.0 alpha release coming soon ● Linaro is a member of ODPi www.odpi.org ODPi
  • 3. ENGINEERS AND DEVICES WORKING TOGETHER ● Distributed and fast in-memory data processing engine ● Provides development APIs to efficiently execute iterative streaming, machine learning or SQL workloads ● Spark was developed as an alternative approach to Map Reduce with easy of use in mind. ● Code in Java, Scala, or Python. Spark
  • 4. ENGINEERS AND DEVICES WORKING TOGETHER ● H2O is a in-memory user friendly machine learning API ● Compatible with Hadoop and Spark ● Spark + H2O is Sparkling Water ● Sparkling Water allows to combine fast & scalable machine learning algorithms of H2O with high performance distributed processing capabilities of Spark engine. ● Spark’s RDD and DataFrame and H2O’s H2OFrame are interoperable ● Users can utilize H2O Flow UI to drive Scala / R / Python computation from Spark H2O Sparkling Water
  • 5. ENGINEERS AND DEVICES WORKING TOGETHER ● Utilizing ODPi v1 based Native Hadoop, Spark, H2O Sparkling Water, H2O flow. ● All Compiled on ARM - ODPi Hadoop 2.7, Spark 1.6 with Scala 2.10 (Scala 2.11 is not supported with SparklingWater) ● 3 node cluster running on Linaro Developer Cloud - HP MoonShot machines ● Dataset files stored in HDFS. ● Spark utilizing Yarn for Resource manager. ● H2O Sparkling water utilizing Spark as execution Engine. ● H2O Flow utilizing Spark SQL API and scala code ● .csv data -> HDFS -> Spark RDD -> H2O H2OFrame https://wiki.linaro.org/LEG/Engineering/BigData Demo
  • 6. Benchmarking Big Data Ganesh Raju and Naresh Bhat
  • 7. ENGINEERS AND DEVICES WORKING TOGETHER ● Various Benchmarking Tools ● Types of Benchmarks and standards ● Challenges of BigData benchmarking on ARM ● Some of the tools that we will be covering are TPC (Transaction Processing Performance Council) based TPCx-HS, TPC-DS, TPC-H benchmark, HiBench (TestDFSIO), Spark-Bench for Apache Spark, MRBench for Mapreduce, NNBench for HDFS...etc Abstract
  • 8. ENGINEERS AND DEVICES WORKING TOGETHER ● Measure performance and scale ● Simulate higher load ○ Find bottlenecks/limits ● Evaluate different hardware/software ○ OS, Java, VM. ○ Hadoop, Spark, Pig, Hive.. ● Validate reliability ● Validate assumptions / Configurations ● Compare two different deployments ● Performance tuning Why Benchmarking ..?
  • 9. ENGINEERS AND DEVICES WORKING TOGETHER Challenges of BigData benchmarking ● System Diversity ○ Variety of Solutions - Data Read, I/O, Streaming, Data warehousing, Machine Learning ● Rapid Data Evolution - Velocity. ● System and Data Scale ● System Complexity ○ Multiple pipelines (layers of Transformations)
  • 10. ENGINEERS AND DEVICES WORKING TOGETHER Types of benchmarks and standards ● Micro benchmarks: To evaluate specific lower-level, system operations ○ E.g. Hadoop Workload Examples (sort, grep, wordcount and Terasort, Gridmix, Pigmix), HiBench, HDFS DFSIO, AMP Lab Big Data Benchmark ● Functional/Component benchmarks: Specific to low level function ○ E.g. Basic SQL queries (select, join, etc.,) ○ Synthetic benchmarks ● Application level ○ Bigbench ○ Spark bench
  • 11. ENGINEERS AND DEVICES WORKING TOGETHER Benchmark Efforts - Microbenchmarks Workloads Software Stacks Metrics HiBench Sort, WordCount, TeraSort, PageRank, K-means, Bayes classification, Index Hadoop and Hive Execution Time, Throughput, resource utilization DFSIO Generate, read, write, append, and remove data for MapReduce jobs Hadoop Execution Time, Throughput AMPLab benchmark Part of CALDA workloads (scan, aggregate and join) and PageRank Hive, Tez Execution Time
  • 12. ENGINEERS AND DEVICES WORKING TOGETHER Benchmark Efforts - TPC Workloads Software Stacks Metrics TPCx-HS HSGen, HSData, Check, HSSort and HSValidate Hadoop Performance, price and energy TPC-H Datawarehousing operations Hive, Pig Execution Time, Throughput TPC-DS Decision support benchmark Data loading, queries and maintenance Hive, Pig Execution Time, Throughput
  • 13. ENGINEERS AND DEVICES WORKING TOGETHER Benchmark Efforts - Synthetic Workloads Software Stacks Metrics SWIM Synthetic user generated MapReduce jobs of reading, writing, shuffling and sorting Hadoop Multiple metrics GridMix Synthetic and basic operations to stress test job scheduler and compression and decompression Hadoop Memory, Execution Time, Throughput PigMix 17 Pig specific queries Hadoop, Pig Execution Time MRBench MapReduce benchmark as a complementary to TeraSort - Datawarehouse operations with 22 TPC-H queries Hadoop Execution Time NNBench and NNBenchWithO utMR Load testing namenode and HDFS I/O with small payloads Hadoop I/O SparkBench CPU, memory and shuffle and IO intensive workloads. Machine Learning, Streaming, Graph Computation and SQL Workloads Spark Execution Time, Data process rate BigBench Interactive-based queries based on synthetic data Hadoop, Spark Execution Time
  • 14. ENGINEERS AND DEVICES WORKING TOGETHER Benchmark Efforts Workloads Software Stacks Metrics BigDataBench 1. Micro Benchmarks (sort, grep, WordCount); 2. Search engine workloads (index, PageRank); 3. Social network workloads (connected components (CC), K-means and BFS); 4. E-commerce site workloads (Relational database queries (select, aggregate and join), collaborative filtering (CF) and Naive Bayes; 5. Multimedia analytics workloads (Speech Recognition, Ray Tracing, Image Segmentation, Face Detection); 6. Bioinformatics workloads Hadoop, DBMSs, NoSQL systems, Hive, Impala, Hbase, MPI, Libc, and other real-time analytics systems Throughput, Memory, CPU (MIPS, MPKI - Misses per instruction)
  • 15. ENGINEERS AND DEVICES WORKING TOGETHER Hadoop benchmark and Test tool ● Hadoop distribution comes with a number of benchmarks ● TestDFSIO, nnbench, mrbench are in hadoop-*test*.jar ● TeraGen, TeraSort, TeraValidate are in hadoop-*examples*.jar ● You can check it using the command $ cd /usr/local/hadoop $ bin/hadoop jar hadoop-*test*.jar $ bin/hadoop jar hadoop-*examples*.jar ● While running the benchmarks you might want to use time command which measure the elapsed time. This saves you the hassle of navigating to the hadoop JobTracker interface. The relevant metric is real value in the first row. $ time hadoop jar hadoop-*examples*.jar ... [...] real 9m15.510s user 0m7.075s sys 0m0.584s
  • 16. ENGINEERS AND DEVICES WORKING TOGETHER TeraGen, TeraSort and TeraValidate ● This is a most well known Hadoop benchmark ● The TeraSort is to sort the data as fast as possible ● This test suite combines HDFS and mapreduce layers of a hadoop cluster ● TeraSort benchmark consists of 3 steps ○ Generate input via TeraGen ○ Run TeraSort on input data ○ Validate sorted output data via TeraValidate https://wiki.linaro.org/LEG/Engineering/BigData/HadoopBuildInstallAndRunGuide
  • 17. ENGINEERS AND DEVICES WORKING TOGETHER HiBench ● Contains 9 typical Hadoop and Spark workloads (including micro benchmarks, HDFS benchmarks, web search benchmarks, machine learning benchmarks using Mahout, and data analytics benchmarks) ● Sort, WordCount, TeraSort, TestDFSIO, Nutch indexing (search indexing using Nutch engine), PageRank (An implementation of Google’s Web page ranking algorithm), hivebench ● Uses zlib compression for input and output ● Metrics: Time (sec) & Throughput (Bytes/Sec), Memory partitions, parallelism, ● Cons: Lack of AARCH bits, Lack of documentations https://wiki.linaro.org/LEG/Engineering/BigData/HiBench
  • 18. ENGINEERS AND DEVICES WORKING TOGETHER TestDFSIO ● It is part of hadoop-mapreduce-client-jobclient.jar ● Stress test I/O performance (throughput and latency) on a clustered setup. ● This test will shake out the hardware, OS and Hadoop setup on your cluster machines (NameNode/DataNode) ● The tests are run as a MapReduce job using 1:1 mapping (1 map / file) ● Helpful to discover performance bottlenecks in your network ● Benchmark write test followed up with read test ● Use -write for write tests and -read for read tests. ● The results stored in TestDFSIO_results.log. Use -resFile to choose different file name
  • 19. ENGINEERS AND DEVICES WORKING TOGETHER Hive Testbench ● Based on TPC-H and TPC-DS benchmarks ● Experiment Apache Hive at any data scale ● Contains data generator and set of queries ● Test the basic Hive performance on large data sets https://wiki.linaro.org/LEG/Engineering/BigData/HiveTestBench
  • 20. ENGINEERS AND DEVICES WORKING TOGETHER MR(Map Reduce) Benchmark for MR ● Loops a small job number of times ● Checks whether small job runs are responsive and running efficiently on your cluster ● Puts focus on MapReduce layer as its impact on the HDFS layer is very limited ● The multiple parallel MRBench issue is resolved. Hence you can run it from different boxes ● Test command to run 50 small test jobs $ hadoop jar hadoop-*test*.jar mrbench -numRuns 50 ● Exemplary output, which means in 31 sec the job finished DataLines Maps Reduces AvgTime (milliseconds) 1 2 1 31414
  • 21. ENGINEERS AND DEVICES WORKING TOGETHER NNBench and NNBenchWithoutMR ● Load testing NameNode through continuous read, write, rename and delete operations on small files ● Stress tests HDFS (I/O) ● To increase stress, multiple instances of NNBenchWithoutMR can be run simultaneously from several machines or increase map tasks for NNBench ● All write tests are run then followed by read tests ● The test command: The below command will run a NameNode benchmark that creates 1000 files using 12 maps and 6 reducers. $ hadoop jar hadoop-*test*.jar nnbench -operation create_write -maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 -replicationFactorPerFile 3 -readFileAfterOpen true -baseDir /benchmarks/NNBench-`hostname -s`
  • 22. ENGINEERS AND DEVICES WORKING TOGETHER TPC Benchmark ● TPCx-HS - https://wiki.linaro.org/LEG/Engineering/BigData/TPCxHS ○ Currently facing problems with cluster shell configuration ● TPC-H ○ TPC-H benchmark focuses on ad-hoc queries ● TPC-DS ○ “the” standard benchmark for decision support ● TPC-C ○ Is an on-line transaction processing (OLTP) benchmark
  • 23. ENGINEERS AND DEVICES WORKING TOGETHER TPCx-HS Benchmark X: Express, H: Hadoop, S: Sort The TPCx-HS kit contains ● TPCx-HS specification documentation ● TPCx-HS User's guide documentation ● Scripts to run benchmarks ● Java code to execute the benchmark load TPCx-HS Execution ● A valid run consists of 5 separate phases run sequentially with overlap in their execution ● The benchmark test consists of 2 runs (Run with lower and higher TPCx-HS Performance Metric) ● No configuration or tuning changes or reboot are allowed between the two runs
  • 24. ENGINEERS AND DEVICES WORKING TOGETHER TPC vs SPEC models TPC model ● Specification based ● Performance, Price, energy in one benchmark ● End-to-End ● Multiple tests (ACID, Load) ● Independent Review ● Full disclosure ● TPC Technology conference SPEC model ● Kit based ● Performance and energy in separate benchmarks ● Server centric ● Single test ● Summary disclosure ● SPEC research group ICPE
  • 25. ENGINEERS AND DEVICES WORKING TOGETHER BigBench ● BigBench is a joint effort with partners in industry and academia on creating a comprehensive and standardized BigData benchmark. ● BigBench builds upon and borrows elements from existing benchmarking efforts (such as TPC-xHS, GridMix, PigMix, HiBench, Big Data Benchmark, YCSB and TPC-DS). ● BigBench is a specification-based benchmark with an open-source reference implementation kit. ● As a specification-based benchmark, it would be technology-agnostic and provide the necessary formalism and flexibility to support multiple implementations. ● Focused around execution time calculation ● Consists of 30 queries/workloads (10 of them are from TPC) ● Drawback - it is structured-data-intensive
  • 26. ENGINEERS AND DEVICES WORKING TOGETHER Spark Bench for Apache Spark ● Build on ARM works ● FAIL: When spark bench examples are run, a KILL signal is observed which terminates all workers. ● This is still under investigation as there are no useful logs to debug. No proper error description and lack of documentation is a challenge. ● A ticket is already filed on spark bench git which is unresolved. ● Con: Lack of documentation.
  • 27. ENGINEERS AND DEVICES WORKING TOGETHER GridMix ● Mix of Synthetic Mapreduce jobs (sorting text data and SequenceFiles) ● Evaluate MapReduce and HDFS performance ● The input file needs to be in JSON format ● Jobs can be either LOADJOB (trace of history logs using Rumen) or SLEEPJOB (A synthetic job where each task does *nothing* but sleep for a certain duration) ● Jobs can be run in STRESS, REPLAY or SERIAL mode ● You can emulate number of users, number of job queries and resource usage (CPU, memory, JVM heap) ● Basic command line usage: (Provided as part of hadoop command) $ hadoop gridmix [-generate <size>] [-users <users-list>] <iopath> <trace> ● Con: Challenging to explore the performance impact of combining or separating workloads, e.g., through consolidating from many clusters.
  • 28. ENGINEERS AND DEVICES WORKING TOGETHER PigMix ● PigMix is a set of queries used test Apache Pig performance ● There are queries that test latency (How long it takes to run this query ?) ● Queries that test scalability (How many fields or records can ping handle before it fails ?) ● Usage: Run the below commands from pig home ant -Dharness.hadoop.home=$HADOOP_HOME pigmix-deploy (generate test dataset) ant -Dharness.hadoop.home=$HADOOP_HOME pigmix (run the PigMix benchmark)
  • 29. ENGINEERS AND DEVICES WORKING TOGETHER SWIM(Statistical Workload Injector for MapReduce) ● Enables rigorous performance measurement of MapReduce systems ● Contains suites of workloads of thousands of jobs, with complex data, arrival, and computation patterns ● Informs both highly targeted, workload specific optimizations ● Highly recommended for MapReduce operators ● Performance measurement https://github.com/SWIMProjectUCB/SWIM/wiki/Performance-measurement-by-ex ecuting-synthetic-or-historical-workloads
  • 30. ENGINEERS AND DEVICES WORKING TOGETHER AmpLab ● The Big Data Benchmark from AMPLab, UC Berkeley provides quantitative and qualitative comparisons of five systems ○ Redshift – a hosted MPP database offered by Amazon.com based on the ParAc warehouse ○ Hive – a Hadoop-based data warehousing system ○ Shark – a Hive-compatible SQL engine which runs on top of the Spark computing framework ○ Impala – a Hive-compatible* SQL engine with its own MPP-like execution engine ○ Stinger/Tez – Tez is a next generation Hadoop execution engine used in Spark ● This benchmark measures response time on a handful of relational queries: scans, aggregations, joins, and UDF’s, across different data sizes.
  • 31. ENGINEERS AND DEVICES WORKING TOGETHER BigDataBench BigDataBench is a benchmark suite for scale-out workloads, different from SPEC CPU (sequential workloads), and PARSEC (multithreaded workloads). Currently, it simulates five typical and important big data applications: search engine, social network, e-commerce, multimedia data analytics, and bioinformatics. Includes 15 real-world data sets, and 34 big data workloads.
  • 32. ENGINEERS AND DEVICES WORKING TOGETHER References https://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-21.pdf Terasort, TestDFSIO, NNBench, MRBench https://wiki.linaro.org/LEG/Engineering/BigData https://wiki.linaro.org/LEG/Engineering/BigData/HadoopTuningGuide https://wiki.linaro.org/LEG/Engineering/BigData/HadoopBuildInstallAndRunGuide http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasor t-testdfsio-nnbench-mrbench/ GridMix3, PigMix, HiBench, TPCx-HS, SWIM, AMPLab, BigBench https://hadoop.apache.org/docs/current/hadoop-gridmix/GridMix.html https://cwiki.apache.org/confluence/display/PIG/PigMix https://wiki.linaro.org/LEG/Engineering/BigData/HiBench https://wiki.linaro.org/LEG/Engineering/BigData/TPCxHS https://github.com/SWIMProjectUCB/SWIM/wiki https://github.com/amplab https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench
  • 33. Thank you ganesh.raju@linaro.org naresh.bhat@linaro.org #LAS16 For further information: www.linaro.org LAS16 keynotes and videos on: connect.linaro.org