Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spark Job-Server with Arvind Heda Kapil Malik


Published on

Kapil Malik and Arvind Heda will discuss a solution for interactive querying of large scale structured data, stored in a distributed file system (HDFS / S3), in a scalable and reliable manner using a unique combination of Spark SQL, Apache Zeppelin and Spark Job-server (SJS) on Yarn. The solution is production tested and can cater to thousands of queries processing terabytes of data every day. It contains following components – 1. Zeppelin server : A custom interpreter is deployed, which de-couples spark context from the user notebooks. It connects to the remote spark context on Spark Job-server. A rich set of APIs are exposed for the users. The user input is parsed, validated and executed remotely on SJS. 2. Spark job-server : A custom application is deployed, which implements the set of APIs exposed on Zeppelin custom interpreter, as one or more spark jobs. 3. Context router : It routes different user queries from custom interpreter to one of many Spark Job-servers / contexts. The solution has following characteristics – * Multi-tenancy There are hundreds of users, each having one or more Zeppelin notebooks. All these notebooks connect to same set of Spark contexts for running a job. * Fault tolerance The notebooks do not use Spark interpreter, but a custom interpreter, connecting to a remote context. If one spark context fails, the context router sends user queries to another context. * Load balancing Context router identifies which contexts are under heavy load / responding slowly, and selects the most optimal context for serving a user query. * Efficiency We use Alluxio for caching common datasets. * Elastic resource usage We use spark dynamic allocation for the contexts. This ensures that cluster resources are blocked by this application only when it’s doing some actual work.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spark Job-Server with Arvind Heda Kapil Malik

  1. 1. Arvind Heda, Kapil Malik Indicium: Interactive Querying at Scale #EUeco9
  2. 2. What’s in the session … • Unified Data Platform on Spark – Single data source for all scheduled / ad hoc jobs and interactive lookup / queries – Data Pipeline – Compute Layer – Interactive Queries? • Indicium: Part 1 (managed context pool) • Indicium: Part 2 (smart query scheduler) 2#EUeco9
  3. 3. Unified Data Platform 3#EUeco9
  4. 4. Unified Data Platform…(for anything / everything) • Common Data Lake for storing – Transactional data – Behavioral data – Computed data • Drives all decisions / recommendations / reporting / analysis from same store. • Single data source for all Decision Edges, Algorithms, BI tools and Ad Hoc and interactive Query / Analysis tools • Data Platform needs to support – Scale – Store everything from summary to raw data. – Concurrency – Handle multiple requests in acceptable user response time. – Ad Hoc Drill down to any level – query, join, correlation on any dimension. 4#EUeco9
  5. 5. Unified Data Platform 5#EUeco9 Query UI Spark Context (Yarn) HDFSHDFSHDFS / S3 Spark Context (Yarn) Scheduled Jobs Compute jobs BI sched uled report s Data Collection service Real time lookup Interactive Query Compute Layer Data Pipeline
  6. 6. Features 6#EUeco9 Features Details Approach Data Persistence Store Large Data Volume of Txn, Behavioural and Computed data; Spark – Parquet format on S3 / HDFS Data Transformations Transformation / Aggregation – co relations and enrichments Batch Processing - Kafka / Java / Spark Jobs Algorithmic Access Aggregated / Raw Data Access for scheduledAlgorithms Spark Processes with SQL Context based data access Decision Making Aggregated Data Access for decision in real time In memory cache of aggregated data Reporting BI / Ad Hoc Query Aggregated / Raw Data Access for scheduled reports (BI) Aggregated / Raw Data Access forAd Hoc Queries BI tool with defined scheduled spark SQL queries on Data store; Interactive Queries Drill down data access on BI tools for concurrent users Ad hoc Query / Analysis on data for concurrent users S c a l i n g c h a l l e n g e s f o r S p a r k S Q L ?
  7. 7. Data Pipeline • Kafka / Sqoop based data collection • Live lookup store for real time decisions • Tenant / Event and time based data partition • Time based compaction to optimize query on sparse data • Summary Profile data to reduce Joins • Shared compute resources but different context for Scheduled / Ad Hoc jobs or for Algorithmic / Human touchpoints 7#EUeco9
  8. 8. Compute Layer • No real ‘real time’ queries -- FIFO scheduling for user tasks • Static or rigid resource allocation between scheduled and ad hoc queries / jobs • Short lived and stateless context - no sticky ness for user defined views like temp tables. • Interactive queries ? 8#EUeco9
  9. 9. What was needed for Interactive query… • SQL like Query Tool for Ad Hoc Analysis. • Scalability for concurrent users, – Fair Scheduling – Responsiveness • High Availability • Performance – specifically for scans and Joins • Extensibility – User Views / Datasets / UDF’s 9#EUeco9
  10. 10. Indicium ? 10#EUeco9
  11. 11. Indicium: Part 1 Managed Context Pool 11#EUeco9
  12. 12. Managed Context Pool 12#EUeco9 Apache Zeppelin SQL Context (Yarn) HDFS HDFS HDFS Spark Job-server
  13. 13. Managed Context Pool Apache Zeppelin 0.6 • SQL like Query tool and a notebook • Custom interpreter - Configuration: SJS server + context - Statement execution: Make asynchronousREST calls to SJS • Concurrency - Multiple interpreters and notebooks Spark Job-Server 0.6.x • Custom SQL context with catalog override • Custom application to execute queries • High Availability: Multiple SJS servers and multiple contexts per server 13#EUeco9
  14. 14. Managed Context Pool Features • Familiar SQL interface on notebooks • Concurrent multi-user support • Visualization Dashboards • Long running Spark Job – to support User Defined Views • Access control on Spark APIs • Custom SQL context with custom catalog – Intercept lookupTable calls to query actual data – Table wrappers for time windows - like select count(*) from `lastXDays(table)` 14#EUeco9
  15. 15. Managed Context Pool Issues • Interpreter hard wired to a context • FIFO scheduling: Single statement per interpreter-context pair – across notebooks / across users • No automated failure handling – Detecting a dead context / SJS server – Recovery from the context / server failure • No dynamic scheduling / load balancing – No way of identify an overloaded context • Incompatible with Spark 2.x 15#EUeco9
  16. 16. Indicium: Part 2 Smart Query Scheduler 16#EUeco9
  17. 17. Smart Query Scheduler 17#EUeco9 Apache Zeppelin SQL Context (Yarn) HDFS HDFS HDFS Spark Job-server Smart Query Scheduler
  18. 18. Smart Query Scheduler Zeppelin 0.7 • Supports per notebook statement execution SJS 0.7 Custom Fork • Support for Spark 2.x Smart Query Scheduler: • Scheduling: API to dynamically bind SJS server + context for every job / query Other Optimizations: • Monitoring: Monitor jobs running per context • Availability: Track Health of SJS servers and contexts and ensures healthy context in pool 18#EUeco9
  19. 19. Smart Query Scheduler Dynamic scheduling for every query • Zeppelin interpreter agnostic of actual SJS / context • Load balancing of jobs per context • Query Classification and intelligent routing • Dynamic scaling / de-scaling the pool size • Shared Cache • User Defined Views • Workspaces or custom time window view for every interpreter 19#EUeco9
  20. 20. Query Classification / routing Custom resource configurations for context dedicated for complex or asynchronous queries / jobs: • Classify queries based on heuristics / historic data into light / heavy queries and route them to different context. • Separate contexts for interactive vs background queries – An export table call does not starve an interactive SQL query 20#EUeco9
  21. 21. Spark Dynamic Context Elastic scaling of contexts, co-existing on same cluster as scheduled batch jobs • Scale up in day time, when user load is high • Scale down in night, when overnight batch jobs are running • Scaling also helped to create reserved bandwidth for any set of users, if needed. 21#EUeco9
  22. 22. Shared Cache Alluxio to store common datasets • Single cache for common datasets across contexts – Avoids replication across contexts – Cached data safe from executor / context crashes • Dedicated refresh thread to release / update data consistently across contexts 22#EUeco9
  23. 23. Persistent User Defined Views • Users can define a temp view for a SQL query • Replicated across all SJS servers + contexts • Definitions persisted in DB so that a context restart is accompanied by temp views’ registration. • Load on start to warm up load of views • TTL support for expiry 23#EUeco9
  24. 24. Workspaces • Support for multiple custom catalogs in SQL context for table resolution • Custom time range / source / caching – Global – Per catalog – Per table • Configurable via Zeppelin interpreter • Decoupled time range from query syntax – Join a behavior table(refer to last 30 days) with lookup table (fetch complete data) 24#EUeco9
  25. 25. Automated Pool Management • Monitoring scripts to track and restart unhealthy / un- responsive SJS servers / contexts • APIs on SJS to stop / start / refresh context / SJS • APIs to refresh cached tables / views; • APIs on Router Service to reconfigure routing / pool size and resource allocation 25#EUeco9
  26. 26. Thank You ! 26#EUeco9 Questions & Answers
  27. 27. References • Apache Zeppelin: • Spark Job-server: jobserver • Alluxio: 27#EUeco9
  28. 28. Scale …. • Data – ~ 100 TB – ~ 1000 Event Types • 100+ Active concurrent users • 30+ Automated Agents • 10000+ Scheduled / 3000+ Ad Hoc Analysis • Avg data churn per Analysis > 200 GB 28#EUeco9