Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building A Self Service Analytics Platform on Hadoop

363 views

Published on

These slides were presented by Avinash Ramineni of Clairvoyant to the Atlanta Apache Spark User Group on Wednesday, March 22, 2017: https://www.meetup.com/Atlanta-Apache-Spark-User-Group/events/238109721/

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Building A Self Service Analytics Platform on Hadoop

  1. 1. 1Page Building a Self Service Analytics Platform on Hadoop Avinash Ramineni
  2. 2. 2Page Clairvoyant
  3. 3. 3Page Clairvoyant Services
  4. 4. 4Page Quick Poll • Big Data Deployments in Prod • Hadoop Distributions • People use Ecosystems rather than tools • Architecture was implemented on Cloudera • Cloud Experience – AWS ?
  5. 5. 5Page Challenges • Data in Silos • Acquires Perspectives as data is moved • Data availability delays • Legacy Systems handling the Volume , Veracity and Velocity • Extracting data from legacy systems • Lack of Self-Service Capabilities • Knowledge becomes tribal – instead of institutional • Security / Compliance Requirements
  6. 6. 6Page Data Lake Attributes • Data Democratization • Data Discovery • Data Lineage • Self-Service capabilities • Metadata Management
  7. 7. 7Page Without Self-Service
  8. 8. 8Page Self-Service at all Levels Ingest Organize Enrich Analyze Dashboards AnalyzeIngest Organize Enrich Insights
  9. 9. 9Page Key Design Tenets • Separation of Compute and Storage • Independently scale compute and storage • Data Democratization and Governance • Bring your own Compute (BYOC) • HA / DR • Open Source Stack
  10. 10. 1 0 Page Separation of Compute and Storage • Scale storage and compute independently • Shifts bottleneck from Disk IO to Network • Centralized Data Storage • Data Democratization • No data duplication • Easier Hardware upgrade paths • Flexible Architecture • DR Simplified
  11. 11. 1 1 Page BYOC (Bring Your Own Cluster) • Each department/application can bring its own Hadoop cluster • Eliminates the need for very large clusters • Easier to administer and maintain • Reduces multi-tenancy issues • Clusters can be upgraded independently • Enables usage based cost model Centralized / Common S3 Storage Marketing Cluster Centralized Storage Personalization Cluster Main Cluster
  12. 12. 1 2 Page Architecture
  13. 13. 1 3 Page Architecture – Data Ingestion Layer • DB Ingestor • Stream Ingestor • Kafka and Spark Streaming • File Ingestor • FTP / SFTP / Logs • Ingestion using Service API
  14. 14. 1 4 Page Architecture – Data Processing Layer • Storage layer carved into logical buckets • Landing, Raw, Derived and Delivery • Schema stored with data (no guesswork) • Platform Jobs • Converting text to Parquet • Saving streaming data Parquet • Derivatives • Compaction • Standardization
  15. 15. 1 5 Page Architecture – Data Delivery Layer • Data Delivery • SQL - Spark Thrift Server / Impala • Tableau, SQL IDE, Applications • Self Service • Derivatives • Represented Via SQL on Delivery Layer • Stored in Derived Storage Layer • Metadata driven • Derived Layer Generators • Long running Spark Job • Derivative Refresh
  16. 16. 1 6 Page Key Takeaways - Cloud • Hadoop Cloud ready-ness • Cloudera Director Limitations • Multi-Availability zone, regions • Storage • Instance Storage • EBS Volumes • gp2 vs st1 • S3 Eventual Consistency
  17. 17. 1 7 Page Key Takeaways - Spark Thrift Server • Spark Thrift Server Support • Performance Tuning • Concurrency • partition strategy • Cache Tables • Compression Codec for Parquet • Snappy vs gzip
  18. 18. 1 8 Page Key Takeaways - Security • Secure by Design, Secure by Default • Access to Data on S3 • IAM Roles • Sentry • Support for Spark • Kerberos • Spark Thrift Server • Navigator • Support for Spark
  19. 19. 1 9 Page Key Takeaways - General • Rapidly Changing Technology • Feature addition • Documentation • Bugs • Jar hell • Small files • Performance Issues • Compaction
  20. 20. 2 0 Page Key Takeaways - General • Partition Strategy • Parquet Files • Balancing parallelism and throughput • Table Partitions • Cluster sizing, optimization and tuning • Integrating with Corporate infrastructure • Deployment practices • Monitoring and Alerting • Information Security Policies
  21. 21. 2 1 Page Data Security
  22. 22. 2 2 Page Questions • Principal @ Clairvoyant • Email: avinash@clairvoyantsoft.com • LinkedIn: https://www.linkedin.com/in/avinashramineni

×