Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Sandeep Varma & Vickye Jain, ZS Associates
High Performance Enterprise
Data Processing with Apache
Spark
#EUde3
2#EUde3
Who are we?
4,500
+
Professionals with deep
industry and domain expertise
help clients significantly
improve their...
Business problems posed to us
3#EUde3
• Revamp large, complex system that is difficult to maintain and understand
• SLAs –...
Use case highlights
4#EUde3
Weekly: ~0.5 TB i/p, ~3 TB o/p50 + data sources
(S3, FTP, Internal DB, Public Cloud)
500+ busi...
High level solution architecture
5#EUde3
Version control
Code Scans
Continuous Integration
DevOps Pipeline
AWS Cloud
Datab...
Learning 1: Shortfall of combined
domain and technical expertise
• SQL is a universal language for domain experts and nati...
Learning 2: Extreme performance
tips in Spark ecosystem
• Stick to S3 à Memory à S3, avoid sinking to HDFS; ensures there ...
Learning 3 – DevOps for Data
Management Platforms
• Configs and input data change more frequently than code in data
manage...
Learning 4: Architecting for
adaptability
• Service oriented architecture with API wrappers works very well in data
manage...
Learning 5: Visual orchestration
• Business desired to visualize operational flows, “where is my pizza?”
• Developer desir...
Upcoming SlideShare
Loading in …5
×

High Performance Enterprise Data Processing with Apache Spark with Sandeep Varma and Vickye Jain

504 views

Published on

Data engineering to support reporting and analytics for commercial Lifesciences groups consists of very complex interdependent processing with highly complex business rules (thousands of transformations on hundreds of data sources). We will talk about our experiences in building a very high performance data processing platform powered by Spark that balances the considerations of extreme performance, speed of development, and cost of maintenance. We will touch upon optimizing enterprise grade Spark architecture for data warehousing and data mart type applications, optimizing end to end pipelines for extreme performance, running hundreds of jobs in parallel in Spark, orchestrating across multiple Spark clusters, and some guidelines for high speed platform and application development within enterprises. Key takeaways: – example architecture for complex data warehousing and data mart applications on Spark – architecture to build high performance Spark platforms for enterprises that balance functionality with total cost of ownership – orchestrating multiple elastic Spark clusters while running hundreds of jobs in parallel – business benefits of high performance data engineering, especially for Lifesciences.

Published in: Data & Analytics
  • Be the first to comment

High Performance Enterprise Data Processing with Apache Spark with Sandeep Varma and Vickye Jain

  1. 1. Sandeep Varma & Vickye Jain, ZS Associates High Performance Enterprise Data Processing with Apache Spark #EUde3
  2. 2. 2#EUde3 Who are we? 4,500 + Professionals with deep industry and domain expertise help clients significantly improve their performance 22OFFICES WORLDWIDE BARCELONA + BOSTON + CHICAGO + EVANSTON + FRANKFURT + LONDON + LOS ANGELES + MILAN NEW DELHI + NEW YORK + PARIS + PHILADELPHIA + PRINCETON + PUNE + SAN DIEGO + SAN FRANCISCO SÃO PAULO + SHANGHAI + SINGAPORE + TOKYO + TORONTO + ZÜRICH OUR OFFERINGS RANGE FROM ZS is a world leader in delivering high- performance solutions for Life Sciences companies to achieve impact
  3. 3. Business problems posed to us 3#EUde3 • Revamp large, complex system that is difficult to maintain and understand • SLAs – fix adherence, speed up 4-5x, enable visual tracking • Build transparency in business rules and eliminate hidden rules • Speed up enhancement cycles by order of magnitude • Reduce costs - infrastructure and operations
  4. 4. Use case highlights 4#EUde3 Weekly: ~0.5 TB i/p, ~3 TB o/p50 + data sources (S3, FTP, Internal DB, Public Cloud) 500+ business rules / KPIs Enterprise scale 2000+ end users Fully elastic cloud infrastructure Service oriented architecture Full blown DevOps pipelines Data to reports in <24 hours 100+ analytics data packs
  5. 5. High level solution architecture 5#EUde3 Version control Code Scans Continuous Integration DevOps Pipeline AWS Cloud Databricks Cloud Airflow API Gateway AWS Lambda Redshift Athena S3 Notebook
  6. 6. Learning 1: Shortfall of combined domain and technical expertise • SQL is a universal language for domain experts and natively supported by Spark making it ideal for domain experts • Build configurable frameworks to abstract technical details and let domain experts focus on the business • We chose PySpark given easy availability of talent and Airflow + helped some domain experts to switch to more powerful PySpark APIs for complex pieces • From enterprises, SQL and PySpark codes are easier to manage and maintain, needing limited specialized skill sets at least for reading 6#EUde3
  7. 7. Learning 2: Extreme performance tips in Spark ecosystem • Stick to S3 à Memory à S3, avoid sinking to HDFS; ensures there are no roadblocks for segregating jobs in clusters and memory is used appropriately • Design for multiple-contexts – attractive to use one context to build up long series of jobs but not resilient; Job servers don’t help much either • Cost based optimizer (in 2.2) is awesome! Analyze tables and turn it on • Keep UDFs in Scala, PySpark UDFs are terrible in terms of performance • Deep nesting of transformations is attractive but can fail, force actions periodically (we hit SPARK-18492 bug) 7#EUde3
  8. 8. Learning 3 – DevOps for Data Management Platforms • Configs and input data change more frequently than code in data management catering to large number of data sources • DevOps pipelines designed for two different test groups: – one to test code with static data and static configs – one to test integrated code, configs, and data with live data and live configs • Game changer for data management platforms, especially ones that change frequently 8#EUde3
  9. 9. Learning 4: Architecting for adaptability • Service oriented architecture with API wrappers works very well in data management and analytics enablement context • We created a number of services with AWS Lambda + API Gateway on top that were called upon by orchestration code in Airflow • Be mindful of 5 min limit with Lambda, work with ECS or Lambda-SQS if appropriate • Even if you are not hosting APIs, design modules to support API type inputs for future extensibility and to force you to think about APIs from day 1 9#EUde3
  10. 10. Learning 5: Visual orchestration • Business desired to visualize operational flows, “where is my pizza?” • Developer desire to “code” instead of configure visual tools • Airflow strikes balance; still in incubation state so expect bugs / issues but still very promising 10#EUde3

×