SlideShare a Scribd company logo
1 of 18
xPatterns … beyond Hadoop
Seattle Scalability Meetup
March 2014
2
• xPatterns Architecture
• xPatterns Infrastructure: From Hadoop & Hive to Spark & Shark & Mesos & Tachyon
• ELT pipeline rebuilt on BDAS
• xPatterns Jaws SharkServer & GUI (Demo)
• Export to NoSql API tool (Demo)
• xPatterns dashboard application (Demo)
• xPatterns monitoring and instrumentation (Demo)
Content
3
4
5
• Hadoop -> Spark: faster distributed computing engine leveraging in-memory computation at a much lower
operational cost, machine learning primitives, simpler programming model (Scala, Python, Java), faster job
submission, shell for quick prototyping and testing, ideal for our iterative algorithms
• Hive -> Shark: interactive queries on large datasets have become reasonable requests (in-memory caching
yields 4-20x performance improvement, ELT script base migration required minimal effort (same familiar
HiveQL, with a few exceptions)
• NO resource manager - > Mesos: multiple workloads from multiple frameworks can co-exist and fairly
consume the cluster resources (policy based). More mature than YARN, allows us to separate production
from experimentation workloads, co-locates legacy Hadoop MR jobs, multiple Shark servers (Jaws), multiple
Spark Job servers, mixed Hive and Shark queries (ELT), and establish priority queues: no more
unmanageable contention and delayed execution while maximizing cluster utilization (dynamic scheduling)
• No Cache -> Tachyon: in-memory distributed file system, with HDFS backup, resilience through lineage
rather than replication, our out-of-process cache that survives Spark JVM restarts, allows for fine tuning
performance and experimenting against cached warehouse tables without reload. Faster than in process
cache due to delayed GC. Provides data sharing between multiple Spark/Shark jobs, efficient in-memory
columnar storage with compression support for minimal footprint
• Cloudera Manager Dashboards-> Ganglia: distributed monitoring system for dashboards with historical
metrics data (CPU, RAM, disk I/O, network I/O) and Spark/Hadoop metrics. This is a nice addition to our
Nagios (monitoring and alerts) and Graphite (instrumentation dashboards)
xPatterns Infrastructure Evolution
6
• 20 billion healthcare records, 200 TB of compressed hdfs data
• Processing pipeline, a mixture of custom MR and mostly Hive scripts, converted to Spark and
Shark, with performance gains of 3-4x (for disk intensive operations) to 60x for queries on
cached tables (Spark cache or Tachyon which is slightly faster with added resilience benefits)
• Daily processing reduced from 14 hours to 1.5hours!
• Shark 0.8.1 does not support: map join auto-conversion, automatic calculation of number of
reducers, reducer or map out phase disk spills, skew joins etc … we have to either manually
fine tune the cluster and the query based on the specific dataset, or we are better off with
Hive under these circumstances … so we use Mesos to manage Hadoop and Spark under the
same cluster, mixing Hive and Shark workloads (demo)
• 0.9.0 fixes many of the problems, but still requires patches!
• Tested against multiple cluster configurations of the same cost, using 3 types of instances:
m1.xlarge (4c x 15GB), m2.4xlarge (8c x 68.4GB) and cc2.8xlarge (32c x 60.8GB).
• Jaws config settings explained: set mapreduce.job.reduces=…, set
shark.column.compress=true, spark.default.parallelism=384,
spark.storage.memoryFraction=0.3, spark.shuffle.memoryFraction=0.6,
spark.shuffle.consolidateFiles=true, spark.shuffle.spill=false|true, spark.mesos.coarse=false,
spark.scheduler.mode=FAIR
ELT processing and data quality pipeline
7
• Jaws: a highly scalable and resilient restful (http) interface on top of a managed Shark session
that can concurrently and asynchronously submit Shark queries, return persisted results
(automatically limited in size or paged), execution logs and job information (Cassandra or hdfs
persisted).
• Jaws can be load balanced for higher availability and scalability and it fuels a web-based GUI
that is integrated in the xPatterns Management Console (Warehouse Explorer)
• Jaws exposes configuration options for fine-tuning Spark & Shark performance and running
against a stand-alone Spark deployment, with or without Tachyon as in-memory distributed
file system on top of HDFS, and with or without Mesos as resource manager
• Provides different deployment recipes for all combinations of Spark, Mesos and Tachyon
• Shark editor provides analysts, data scientists with a view into the warehouse through a
metadata explorer, provides a query editor with intelligent features like auto-complete, a
results viewer, logs viewer and historical queries for asynchronously retrieving persisted
results, logs and query information for both running and historical queries
Jaws REST SharkServer & GUI
8
Jaws REST SharkServer & GUI
9
Export to NoSql API
• Datasets in the warehouse need to be exposed to high-throughput low-latency real-time
APIs. Each application requires extra processing performed on top of the core datasets,
hence additional transformations are executed for building data marts inside the
warehouse
• Exporter tool builds the efficient data model and runs an export of data from a Shark/Hive
table to a Cassandra Column Family, through a custom Spark job with configurable
throughput (configurable Spark processors against a Cassandra ring) (instrumentation
dashboard embedded, logs, progress and instrumentation events pushed though SSE)
• Data Modeling is driven by the read access patterns provided by an application engineer
building dashboards and visualizations: lookup key, columns (record fields to read), paging,
sorting, filtering
• The end result of a job run is a REST API endpoint (instrumented, monitored, resilient, geo-
replicated) that uses the underlying generated Cassandra data model and fuels the data in
the dashboards
• Configuration API provided for creating export jobs and executing them (ad-hoc or
scheduled).
10
11
Referral Provider Network
• One of the many applications that we built for our largest healthcare customers using
the xPatterns APIs and tools on the new upgraded infrastructure: ELT Pipeline, Jaws,
Export to NoSql API. The dashboard for the RPN application was built using D3.js and
angular against the generic api published by the export tool.
• The application allows for building a graph of downstream and upstream referred and
referring providers, grouped by specialty, with computed aggregates like patient counts,
claim counts and total charged amounts. RPN is used for both fraud detection and for
aiding a clinic buying decision, by following the busiest graph paths.
• The dataset behind the app consists of 8 billion medical records, from which we
extracted 1.7 million providers (Shark warehouse) and built 53 million relationships in
the graph (persisted in Cassandra)
• While we demo the graph building we will also look at the Graphite instrumentation
dashboard for analyzing the runtime performance of the geo-replicated Cassandra read
operations (latency in the 20-50ms range)
12
13
Graphite – Cassandra multi DC ring
14
15
16
17
• Export to Semantic Search API (solrCloud/lucene)
• T component (data transformation and monitoring
pipeline, with Spark/Shark/Hadoop MR/Hive support)
• pySpark Job Server
• pySpark  Shark/Tachyon interop (either)
• pySpark  Spark SQL (1.0) interop (or)
• Parquet columnar storage for warehouse data
Coming soon …
18
Q & A

More Related Content

What's hot

Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Tachyon meetup San Francisco Oct 2014
Tachyon meetup San Francisco Oct 2014Tachyon meetup San Francisco Oct 2014
Tachyon meetup San Francisco Oct 2014Claudiu Barbura
 
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy StarzhinskySpark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy StarzhinskySpark Summit
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleHelena Edelson
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Evan Chan
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Robert "Chip" Senkbeil
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
Lambda Architecture with Spark
Lambda Architecture with SparkLambda Architecture with Spark
Lambda Architecture with SparkKnoldus Inc.
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1Joe Stein
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK StackKnoldus Inc.
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkMammoth Data
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit
 

What's hot (20)

Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Tachyon meetup San Francisco Oct 2014
Tachyon meetup San Francisco Oct 2014Tachyon meetup San Francisco Oct 2014
Tachyon meetup San Francisco Oct 2014
 
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy StarzhinskySpark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar Veliqi
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Lambda Architecture with Spark
Lambda Architecture with SparkLambda Architecture with Spark
Lambda Architecture with Spark
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK Stack
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 

Viewers also liked

Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14Sri Ambati
 
Akka in Practice: Designing Actor-based Applications
Akka in Practice: Designing Actor-based ApplicationsAkka in Practice: Designing Actor-based Applications
Akka in Practice: Designing Actor-based ApplicationsNLJUG
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on AndroidTomáš Kypta
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsStephane Manciot
 
A Journey to Reactive Function Programming
A Journey to Reactive Function ProgrammingA Journey to Reactive Function Programming
A Journey to Reactive Function ProgrammingAhmed Soliman
 
Akka and AngularJS – Reactive Applications in Practice
Akka and AngularJS – Reactive Applications in PracticeAkka and AngularJS – Reactive Applications in Practice
Akka and AngularJS – Reactive Applications in PracticeRoland Kuhn
 
Docker. Does it matter for Java developer ?
Docker. Does it matter for Java developer ?Docker. Does it matter for Java developer ?
Docker. Does it matter for Java developer ?Izzet Mustafaiev
 
Reactive Streams and RabbitMQ
Reactive Streams and RabbitMQReactive Streams and RabbitMQ
Reactive Streams and RabbitMQmkiedys
 
Reactive Jersey Client
Reactive Jersey ClientReactive Jersey Client
Reactive Jersey ClientMichal Gajdos
 
Resilient Applications with Akka Persistence - Scaladays 2014
Resilient Applications with Akka Persistence - Scaladays 2014Resilient Applications with Akka Persistence - Scaladays 2014
Resilient Applications with Akka Persistence - Scaladays 2014Björn Antonsson
 
12 Factor App: Best Practices for JVM Deployment
12 Factor App: Best Practices for JVM Deployment12 Factor App: Best Practices for JVM Deployment
12 Factor App: Best Practices for JVM DeploymentJoe Kutner
 
Micro services, reactive manifesto and 12-factors
Micro services, reactive manifesto and 12-factorsMicro services, reactive manifesto and 12-factors
Micro services, reactive manifesto and 12-factorsDejan Glozic
 

Viewers also liked (12)

Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
 
Akka in Practice: Designing Actor-based Applications
Akka in Practice: Designing Actor-based ApplicationsAkka in Practice: Designing Actor-based Applications
Akka in Practice: Designing Actor-based Applications
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on Android
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
 
A Journey to Reactive Function Programming
A Journey to Reactive Function ProgrammingA Journey to Reactive Function Programming
A Journey to Reactive Function Programming
 
Akka and AngularJS – Reactive Applications in Practice
Akka and AngularJS – Reactive Applications in PracticeAkka and AngularJS – Reactive Applications in Practice
Akka and AngularJS – Reactive Applications in Practice
 
Docker. Does it matter for Java developer ?
Docker. Does it matter for Java developer ?Docker. Does it matter for Java developer ?
Docker. Does it matter for Java developer ?
 
Reactive Streams and RabbitMQ
Reactive Streams and RabbitMQReactive Streams and RabbitMQ
Reactive Streams and RabbitMQ
 
Reactive Jersey Client
Reactive Jersey ClientReactive Jersey Client
Reactive Jersey Client
 
Resilient Applications with Akka Persistence - Scaladays 2014
Resilient Applications with Akka Persistence - Scaladays 2014Resilient Applications with Akka Persistence - Scaladays 2014
Resilient Applications with Akka Persistence - Scaladays 2014
 
12 Factor App: Best Practices for JVM Deployment
12 Factor App: Best Practices for JVM Deployment12 Factor App: Best Practices for JVM Deployment
12 Factor App: Best Practices for JVM Deployment
 
Micro services, reactive manifesto and 12-factors
Micro services, reactive manifesto and 12-factorsMicro services, reactive manifesto and 12-factors
Micro services, reactive manifesto and 12-factors
 

Similar to xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)

xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014Claudiu Barbura
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problemsAbhishek Gupta
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...Megha Shah
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Sparkhbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and SparkMichael Stack
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsNetajiGandi1
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Joan Novino
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 

Similar to xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon) (20)

xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problems
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Sparkhbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Cassandra in xPatterns
Cassandra in xPatternsCassandra in xPatterns
Cassandra in xPatterns
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 

More from Claudiu Barbura

Overkill Analytics Seattle Spark Meetup
Overkill Analytics Seattle Spark MeetupOverkill Analytics Seattle Spark Meetup
Overkill Analytics Seattle Spark MeetupClaudiu Barbura
 
Autonomous analytics on streaming data
Autonomous analytics on streaming dataAutonomous analytics on streaming data
Autonomous analytics on streaming dataClaudiu Barbura
 
Spark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internalsSpark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internalsClaudiu Barbura
 
Building an intelligent big data application in 30 minutes
Building an intelligent big data application in 30 minutesBuilding an intelligent big data application in 30 minutes
Building an intelligent big data application in 30 minutesClaudiu Barbura
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsClaudiu Barbura
 

More from Claudiu Barbura (6)

Overkill Analytics Seattle Spark Meetup
Overkill Analytics Seattle Spark MeetupOverkill Analytics Seattle Spark Meetup
Overkill Analytics Seattle Spark Meetup
 
Overkill Analytics
Overkill AnalyticsOverkill Analytics
Overkill Analytics
 
Autonomous analytics on streaming data
Autonomous analytics on streaming dataAutonomous analytics on streaming data
Autonomous analytics on streaming data
 
Spark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internalsSpark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internals
 
Building an intelligent big data application in 30 minutes
Building an intelligent big data application in 30 minutesBuilding an intelligent big data application in 30 minutes
Building an intelligent big data application in 30 minutes
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
 

Recently uploaded

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)

  • 1. xPatterns … beyond Hadoop Seattle Scalability Meetup March 2014
  • 2. 2 • xPatterns Architecture • xPatterns Infrastructure: From Hadoop & Hive to Spark & Shark & Mesos & Tachyon • ELT pipeline rebuilt on BDAS • xPatterns Jaws SharkServer & GUI (Demo) • Export to NoSql API tool (Demo) • xPatterns dashboard application (Demo) • xPatterns monitoring and instrumentation (Demo) Content
  • 3. 3
  • 4. 4
  • 5. 5 • Hadoop -> Spark: faster distributed computing engine leveraging in-memory computation at a much lower operational cost, machine learning primitives, simpler programming model (Scala, Python, Java), faster job submission, shell for quick prototyping and testing, ideal for our iterative algorithms • Hive -> Shark: interactive queries on large datasets have become reasonable requests (in-memory caching yields 4-20x performance improvement, ELT script base migration required minimal effort (same familiar HiveQL, with a few exceptions) • NO resource manager - > Mesos: multiple workloads from multiple frameworks can co-exist and fairly consume the cluster resources (policy based). More mature than YARN, allows us to separate production from experimentation workloads, co-locates legacy Hadoop MR jobs, multiple Shark servers (Jaws), multiple Spark Job servers, mixed Hive and Shark queries (ELT), and establish priority queues: no more unmanageable contention and delayed execution while maximizing cluster utilization (dynamic scheduling) • No Cache -> Tachyon: in-memory distributed file system, with HDFS backup, resilience through lineage rather than replication, our out-of-process cache that survives Spark JVM restarts, allows for fine tuning performance and experimenting against cached warehouse tables without reload. Faster than in process cache due to delayed GC. Provides data sharing between multiple Spark/Shark jobs, efficient in-memory columnar storage with compression support for minimal footprint • Cloudera Manager Dashboards-> Ganglia: distributed monitoring system for dashboards with historical metrics data (CPU, RAM, disk I/O, network I/O) and Spark/Hadoop metrics. This is a nice addition to our Nagios (monitoring and alerts) and Graphite (instrumentation dashboards) xPatterns Infrastructure Evolution
  • 6. 6 • 20 billion healthcare records, 200 TB of compressed hdfs data • Processing pipeline, a mixture of custom MR and mostly Hive scripts, converted to Spark and Shark, with performance gains of 3-4x (for disk intensive operations) to 60x for queries on cached tables (Spark cache or Tachyon which is slightly faster with added resilience benefits) • Daily processing reduced from 14 hours to 1.5hours! • Shark 0.8.1 does not support: map join auto-conversion, automatic calculation of number of reducers, reducer or map out phase disk spills, skew joins etc … we have to either manually fine tune the cluster and the query based on the specific dataset, or we are better off with Hive under these circumstances … so we use Mesos to manage Hadoop and Spark under the same cluster, mixing Hive and Shark workloads (demo) • 0.9.0 fixes many of the problems, but still requires patches! • Tested against multiple cluster configurations of the same cost, using 3 types of instances: m1.xlarge (4c x 15GB), m2.4xlarge (8c x 68.4GB) and cc2.8xlarge (32c x 60.8GB). • Jaws config settings explained: set mapreduce.job.reduces=…, set shark.column.compress=true, spark.default.parallelism=384, spark.storage.memoryFraction=0.3, spark.shuffle.memoryFraction=0.6, spark.shuffle.consolidateFiles=true, spark.shuffle.spill=false|true, spark.mesos.coarse=false, spark.scheduler.mode=FAIR ELT processing and data quality pipeline
  • 7. 7 • Jaws: a highly scalable and resilient restful (http) interface on top of a managed Shark session that can concurrently and asynchronously submit Shark queries, return persisted results (automatically limited in size or paged), execution logs and job information (Cassandra or hdfs persisted). • Jaws can be load balanced for higher availability and scalability and it fuels a web-based GUI that is integrated in the xPatterns Management Console (Warehouse Explorer) • Jaws exposes configuration options for fine-tuning Spark & Shark performance and running against a stand-alone Spark deployment, with or without Tachyon as in-memory distributed file system on top of HDFS, and with or without Mesos as resource manager • Provides different deployment recipes for all combinations of Spark, Mesos and Tachyon • Shark editor provides analysts, data scientists with a view into the warehouse through a metadata explorer, provides a query editor with intelligent features like auto-complete, a results viewer, logs viewer and historical queries for asynchronously retrieving persisted results, logs and query information for both running and historical queries Jaws REST SharkServer & GUI
  • 9. 9 Export to NoSql API • Datasets in the warehouse need to be exposed to high-throughput low-latency real-time APIs. Each application requires extra processing performed on top of the core datasets, hence additional transformations are executed for building data marts inside the warehouse • Exporter tool builds the efficient data model and runs an export of data from a Shark/Hive table to a Cassandra Column Family, through a custom Spark job with configurable throughput (configurable Spark processors against a Cassandra ring) (instrumentation dashboard embedded, logs, progress and instrumentation events pushed though SSE) • Data Modeling is driven by the read access patterns provided by an application engineer building dashboards and visualizations: lookup key, columns (record fields to read), paging, sorting, filtering • The end result of a job run is a REST API endpoint (instrumented, monitored, resilient, geo- replicated) that uses the underlying generated Cassandra data model and fuels the data in the dashboards • Configuration API provided for creating export jobs and executing them (ad-hoc or scheduled).
  • 10. 10
  • 11. 11 Referral Provider Network • One of the many applications that we built for our largest healthcare customers using the xPatterns APIs and tools on the new upgraded infrastructure: ELT Pipeline, Jaws, Export to NoSql API. The dashboard for the RPN application was built using D3.js and angular against the generic api published by the export tool. • The application allows for building a graph of downstream and upstream referred and referring providers, grouped by specialty, with computed aggregates like patient counts, claim counts and total charged amounts. RPN is used for both fraud detection and for aiding a clinic buying decision, by following the busiest graph paths. • The dataset behind the app consists of 8 billion medical records, from which we extracted 1.7 million providers (Shark warehouse) and built 53 million relationships in the graph (persisted in Cassandra) • While we demo the graph building we will also look at the Graphite instrumentation dashboard for analyzing the runtime performance of the geo-replicated Cassandra read operations (latency in the 20-50ms range)
  • 12. 12
  • 13. 13 Graphite – Cassandra multi DC ring
  • 14. 14
  • 15. 15
  • 16. 16
  • 17. 17 • Export to Semantic Search API (solrCloud/lucene) • T component (data transformation and monitoring pipeline, with Spark/Shark/Hadoop MR/Hive support) • pySpark Job Server • pySpark  Shark/Tachyon interop (either) • pySpark  Spark SQL (1.0) interop (or) • Parquet columnar storage for warehouse data Coming soon …

Editor's Notes

  1. the logical architecture diagram with the 3 logical layers of xPatterns: Infrastructure, Analytics and Visualization and the roles: ELT Engineer, Data Scientist, Application Engineer.xPatterns is a big data analytics platform as a service that enables a rapid development of enterprise-grade analytical applications. It provides tools, api sets and a management console for building an ELT pipeline with data monitoring and quality gates, a data warehouse for ad-hoc and scheduled querying, analysis, model building and experimentation, tools for exporting data to NoSql and solrCloud cluster for real-time access through low-latency/high-throughput apis as well as dashboard and visualization api/tools leveraging the available data and models. In this presentation we will showcase one of the analytical applications build on top of xPatterns for our largest customer for that runs xPatterns in production on top a data warehouse consisting of several hundreds TB of medical, pharmacy and lab data records consisting of tens of billions of records. We will showcase the xPatterns components in the form of APIs and tools employed throughout the entire lifecycle of this application.”
  2. physical architecture diagram for our largest customer deployment, demonstrating the enterprise-grade attributes of the platform: scalability, high availability, performance, resilience, manageability while providing means for geo-failover (warehouse), geo-replication (real-time DB), data and system monitoring, instrumentation, backup & restoreThis is not our complete reference architecture, as it is missing solrCloud clusters (semantic search and dynamic ontologies), ZooKeeper clusters (distributed locks, synchronization, metadata and configuration management, cluster management), real time model scoring services, real time analytics stack (Kafka, Storm, node.js/spray/akka),
  3. Ganglia allows us to not necessarily include Cloudera Manager in our management console and use any Hadoopdistro for HDFS and leftover MR and Hive … Hadoop -> Spark: faster distributed computing engine leveraging in-memory computation at a much lower operational cost, machine learning primitives, simpler programming model (Scala, Python, Java), faster job submission, shell for quick prototyping and testing, ideal for our iterative algorithmsHive -> Shark: interactive queries on large datasets have become reasonable requests (in-memory caching yields 4-20x performance improvement, ELT script base migration required minimal effort (same familiar HiveQL, with a few exceptions)NO resource manager - > Mesos: multiple workloads from multiple frameworks can co-exist and fairly consume the cluster resources (policy based). More mature than YARN, allows us to separate production from experimentation workloads, co-locates Spark (ingestion pipeline & export to NoSql jobs), legacy Hadoop MR jobs, multiple Shark servers (Jaws), multiple Spark Job servers, mixed Hive and Shark queries (ELT), and establish priority queues: no more unmanageable contention and delayed execution while maximizing cluster utilizationNo Cache -> Tachyon: in-memory distributed file system, with HDFS backup, resilience through lineage rather than replication, our out-of-process cache that survives Spark JVM restarts, allows for fine tuning performance and experimenting against cached warehouse tables without reload. Faster than in process cache due to delayed GC. Provides data sharing between multiple Spark/Shark jobs, efficient in-memory columnar storage with compression support for minimal footprintCloudera Manager -> Ganglia: distributed monitoring system with a web interface for dashboards with historical metrics data (CPU, RAM, disk I/O, network I/O). This is a nice addition to our Nagios (monitoring and alerts) and Graphite (instrumentation dashboards)
  4. We converted the data ingestion tool to Spark so we can completely replace Hadoop and Hive in the future and for the faster job submission. We are converting all of our MR jobs to Spark jobs, for both ingestion and exports to CassandraDifferent dataset require different cluster configurations, biggest problem of Spark 0.8.1 is that intermediate output and reducer input is not spilled to disk, causing OOM frequently (0.9.0 solves the reducer part).
  5. Jaws: a highly scalable and resilient restful (http) interface on top of a managed Shark session that can concurrently and asynchronously submit Shark queries, return persisted results (automatically limited in size), execution logs and job information (Cassandra or hdfs persisted).Jaws can be load balanced for higher availability and scalability and it fuels a web-based GUI called Shark Editor that is integrated in the xPatterns Management ConsoleJaws exposes configuration options for fine-tuning Spark & Shark performance and running against a stand-alone Spark deployment, with or without Tachyon as in-memory distributed file system on top of HDFS, and with or without Mesos as resource manager Provides different deployment recipes for all combinations of Spark, Mesos and TachyonShark editor provides analysts, data scientists with a view into the warehouse through a metadata explorer, provides a query editor with intelligent features like auto-complete, a results viewer, logs viewer and historical queries for asynchronously retrieving persisted results, logs and query information for both running and historical queries (DEMO)
  6. Datasets in the warehouse need to be exposed to high-throughput low-latency real-time APIs. Each application requires extra processing performed on top of the core datasets, hence additional transformations are executed for building data marts inside the warehousePre-optimization Shark/Hive queries required for building an efficient data model for Cassandra persistence: minimal number of column families, wide rows (50-100 MB compressed). Resulting data model is efficient for both read (dashboard/API) and write (export/updates) requestsExporter tool builds the efficient data model and runs an export of data from a Shark/Hive table to a Cassandra Column Family, through a custom Spark job with configurable throughput (configurable Spark processors against a Cassandra ring)Data Modeling is driven by the read access patterns: lookup key, columns (record fields to read), paging, sorting, filtering.The data access patterns is used for automatically publishing a REST api that uses the underlying generated Cassandra data model and it fuels the data in the dashboardsExecution logs behind workflows, progress report and instrumentation events for the dashboard are pushed to the browser through SSE (Zookeeper watchers used for synchronization)
  7. Datasets in the warehouse need to be exposed to high-throughput low-latency real-time APIs. Each application requires extra processing performed on top of the core datasets, hence additional transformations are executed for building data marts inside the warehousePre-optimization Shark/Hive queries required for building an efficient data model for Cassandra persistence: minimal number of column families, wide rows (50-100 MB compressed). Resulting data model is efficient for both read (dashboard/API) and write (export/updates) requestsExporter tool builds the efficient data model and runs an export of data from a Shark/Hive table to a Cassandra Column Family, through a custom Spark job with configurable throughput (configurable Spark processors against a Cassandra ring)Data Modeling is driven by the read access patterns: lookup key, columns (record fields to read), paging, sorting, filtering.The data access patterns is used for automatically publishing a REST api that uses the underlying generated Cassandra data model and it fuels the data in the dashboardsExecution logs behind workflows, progress report and instrumentation events for the dashboard are pushed to the browser through SSE (Zookeeper watchers used for synchronization)
  8. Referral Provider Network: one of the 6 applications that we built for our healthcare customer using the xPatterns APIs and tools on the new beyond Hadoop infrastructure: ELT Pipeline, Export to NoSQL API. The dashboard for the RPN application was built using D3.js and angular against the generic api published by the export tool. The application allows for building a graph of downstream and upstream referred and referring providers, grouped by specialty and with computed aggregates like patient counts, claim counts and total charged amounts. RPN is used for both fraud detection and for aiding a clinic buying decision, by following the busiest graph paths.The dataset behind the app consists of 8 billion medical records, from which we extracted 1.7 million providers (Shark warehouse) and built 53 million relationships in the graph (persisted in Cassandra)While we demo the graph building we will also look at the Graphite instrumentation dashboard for analyzing the runtime performance of the geo-replicated Cassandra read operations during the demo 
  9. Referral Provider Network: one of the 6 applications that we built for our healthcare customer using the xPatterns APIs and tools on the new beyond Hadoop infrastructure: ELT Pipeline, Export to NoSQL API. The dashboard for the RPN application was built using D3.js and angular against the generic api published by the export tool. The application allows for building a graph of downstream and upstream referred and referring providers, grouped by specialty and with computed aggregates like patient counts, claim counts and total charged amounts. RPN is used for both fraud detection and for aiding a clinic buying decision, by following the busiest graph paths.The dataset behind the app consists of 8 billion medical records, from which we extracted 1.7 million providers (Shark warehouse) and built 53 million relationships in the graph (persisted in Cassandra)While we demo the graph building we will also look at the Graphite instrumentation dashboard for analyzing the runtime performance of the geo-replicated Cassandra read operations during the demo 
  10. Instrumentation dashboard showcasing the read latency measured during peak (40ms average, 60peak)