Building A Self Service Analytics Platform on Hadoop

•

1 like•803 views

These slides were presented by Avinash Ramineni of Clairvoyant to the Atlanta Apache Spark User Group on Wednesday, March 22, 2017: https://www.meetup.com/Atlanta-Apache-Spark-User-Group/events/238109721/

Technology

1Page
Building a Self Service Analytics
Platform on Hadoop
Avinash Ramineni

4Page
Quick Poll
• Big Data Deployments in Prod
• Hadoop Distributions
• People use Ecosystems rather than tools
• Architecture was implemented on Cloudera
• Cloud Experience – AWS ?

5Page
Challenges
• Data in Silos
• Acquires Perspectives as data is moved
• Data availability delays
• Legacy Systems handling the Volume , Veracity and Velocity
• Extracting data from legacy systems
• Lack of Self-Service Capabilities
• Knowledge becomes tribal – instead of institutional
• Security / Compliance Requirements

6Page
Data Lake Attributes
• Data Democratization
• Data Discovery
• Data Lineage
• Self-Service capabilities
• Metadata Management

8Page
Self-Service at all Levels
Ingest Organize Enrich Analyze Dashboards
AnalyzeIngest Organize Enrich Insights

9Page
Key Design Tenets
• Separation of Compute and Storage
• Independently scale compute and storage
• Data Democratization and Governance
• Bring your own Compute (BYOC)
• HA / DR
• Open Source Stack

1
0
Page
Separation of Compute and Storage
• Scale storage and compute independently
• Shifts bottleneck from Disk IO to Network
• Centralized Data Storage
• Data Democratization
• No data duplication
• Easier Hardware upgrade paths
• Flexible Architecture
• DR Simplified

1
1
Page
BYOC (Bring Your Own Cluster)
• Each department/application can bring its own Hadoop cluster
• Eliminates the need for very large clusters
• Easier to administer and maintain
• Reduces multi-tenancy issues
• Clusters can be upgraded independently
• Enables usage based cost model
Centralized / Common S3 Storage
Marketing
Cluster
Centralized
Storage
Personalization
Cluster
Main
Cluster

1
3
Page
Architecture – Data Ingestion Layer
• DB Ingestor
• Stream Ingestor
• Kafka and Spark Streaming
• File Ingestor
• FTP / SFTP / Logs
• Ingestion using Service API

1
4
Page
Architecture – Data Processing Layer
• Storage layer carved into logical buckets
• Landing, Raw, Derived and Delivery
• Schema stored with data (no guesswork)
• Platform Jobs
• Converting text to Parquet
• Saving streaming data Parquet
• Derivatives
• Compaction
• Standardization

1
5
Page
Architecture – Data Delivery Layer
• Data Delivery
• SQL - Spark Thrift Server / Impala
• Tableau, SQL IDE, Applications
• Self Service
• Derivatives
• Represented Via SQL on Delivery Layer
• Stored in Derived Storage Layer
• Metadata driven
• Derived Layer Generators
• Long running Spark Job
• Derivative Refresh

1
6
Page
Key Takeaways - Cloud
• Hadoop Cloud ready-ness
• Cloudera Director Limitations
• Multi-Availability zone, regions
• Storage
• Instance Storage
• EBS Volumes
• gp2 vs st1
• S3 Eventual Consistency

1
7
Page
Key Takeaways - Spark Thrift Server
• Spark Thrift Server Support
• Performance Tuning
• Concurrency
• partition strategy
• Cache Tables
• Compression Codec for Parquet
• Snappy vs gzip

1
8
Page
Key Takeaways - Security
• Secure by Design, Secure by Default
• Access to Data on S3
• IAM Roles
• Sentry
• Support for Spark
• Kerberos
• Spark Thrift Server
• Navigator
• Support for Spark

1
9
Page
Key Takeaways - General
• Rapidly Changing Technology
• Feature addition
• Documentation
• Bugs
• Jar hell
• Small files
• Performance Issues
• Compaction

2
0
Page
Key Takeaways - General
• Partition Strategy
• Parquet Files
• Balancing parallelism and throughput
• Table Partitions
• Cluster sizing, optimization and tuning
• Integrating with Corporate infrastructure
• Deployment practices
• Monitoring and Alerting
• Information Security Policies

2
2
Page
Questions
• Principal @ Clairvoyant
• Email: avinash@clairvoyantsoft.com
• LinkedIn: https://www.linkedin.com/in/avinashramineni

This document provides an overview of Anzo Unstructured, a natural language processing (NLP) platform from Cambridge Semantics. It discusses the core capabilities of Anzo Unstructured, including intake of various file formats, extraction of entities and relationships, and semantic analysis. It also outlines example use cases in pharma and finance. The document demonstrates the configuration and visualization of Anzo Unstructured pipelines and annotations.

How to Build a Smart Data Lake Using Semantics

Cambridge Semantics

This document provides an overview and introduction to Cambridge Semantics Inc. and their Anzo Smart Data Platform for building smart data lakes using semantics. Key points include: - Cambridge Semantics was founded in 2007 and their Anzo software suite uses open semantic web standards to create data analytics and management solutions from diverse data sources. - While data lakes make it easy to assemble large volumes of data, identifying and linking data across sources remains challenging without harmonization of meanings. Semantic models and tools can help address these issues. - The Anzo Analytics and Data Integration Suite uses business understandable semantic models to describe, search, query and analyze data from various structured and unstructured sources to build a smart data lake.

Transforming Data Management and Time to Insight with Anzo Smart Data Lake®

Cambridge Semantics

The document discusses how Anzo Smart Data Lake can help government agencies transform data management and increase time to insight. It provides an overview of Anzo and how it uses semantic knowledge graphs to link and harmonize diverse data sources for self-service data preparation, discovery, and analytics. Examples are given of how Anzo has helped organizations in intelligence and defense integrate data sources and gain better visibility into areas like contract performance. The presentation concludes by discussing how Anzo could help agencies drive business efficiency, enable more self-service for citizens using public data, and suggests next steps of proof of concept or proposal.

Splunk Business Analytics

CleverDATA

The document discusses how traditional analytics approaches are no longer sufficient due to new data sources like machine data that are unstructured and from external sources. It introduces Splunk as a platform that can collect, index, and analyze massive amounts of machine data in real-time to provide operational intelligence and business insights. Splunk uses late binding schema to allow ad-hoc queries over heterogeneous machine data without needing to design schemas upfront. It can complement traditional BI tools by focusing on real-time analytics over machine data while traditional tools focus on structured data.

Using Cloud Automation Technologies to Deliver an Enterprise Data Fabric

Cambridge Semantics

The Year of the Graph

Cambridge Semantics

Graph technology has truly burst onto the scene with diverse new products and services, proving that graph is relevant and that not all graph use cases are equal. Previously relegated to niche implementations and science projects, graph now finds itself deployed as the foundational technology for enterprise analytics solutions and enterprise Data Fabric strategies. It is no surprise that many are calling 2018 “The Year of the Graph”.

How to build a successful Data Lake

DataWorks Summit/Hadoop Summit

This document discusses how to build a successful data lake by focusing on the right data, platform, and interface. It emphasizes the importance of saving raw data to analyze later, organizing the data lake into zones with different governance levels, and providing self-service tools to find, understand, provision, prepare, and analyze data. It promotes the use of a smart data catalog like Waterline Data to automate metadata tagging, enable data discovery and collaboration, and maximize business value from the data lake.

When it comes to creating an enterprise AI strategy: if your company isn’t good at analytics, it’s not ready for AI. Succeeding in AI requires being good at data engineering AND analytics. Unfortunately, management teams often assume they can leapfrog best practices for basic data analytics by directly adopting advanced technologies such as ML/AI – setting themselves up for failure from the get-go. This presentation explains how to get basic data engineering and the right technology in place to create and maintain data pipelines so that you can solve problems with AI successfully.

Modern Data Discovery and Integration in Retail Banking

Cambridge Semantics

Retail banks are moving beyond the data warehouse and data lake and are now implementing data fabric architectures to address data discovery and integration challenges. These are the slides from our webinar "Modern Data Discovery and Integration in Retail Banking" in which we explore the role of the data discovery and integration layer in a data fabric with special focus on evolution from data warehouse to data fabric, semantics and graph data models in data fabric and example use cases in retail banks and B2C financial services.

Necessity of Data Lakes in the Financial Services Sector

DataWorks Summit

With the emergence of regulations such as the General Data Protection Regulation from the European Union (effective May 2018), with fines up to 20m Euro, Data Lakes are emerging as the data architecture of choice amongst financial institutions. Banks are embarking on a journey to enable data scientists to unlock the value of the data silo'ed in many disparate data systems. By enabling self service data access and merging multiple streams of data by using data clustering, entity extraction, identity resolution and other techniques - we will show how banks have used Analytics to uncover business value without falling into the abyss of data swamps. The build out of the data lake requires the ingestion of data from multiple operational systems . By leveraging an automated Data Cataloging service, organizations are able to search, profile, discover, tag, track lineage and capture tribal knowledge delivered on the FICO Analytics Cloud enabling the data scientists to build innovative models, make automated decisions, track fraudulent usage, make intelligent marketing campaigns and improve the top line and bottom line for the financial institution. Speaker: Rohit Valia, Product Management and Strategy, Fico

Accelerate Digital Transformation with an Enterprise Big Data Fabric

Cambridge Semantics

In this webinar by Cambridge Semantics' VP of Solution Engineering, Ben Szekely, you will learn more about how the Enterprise Data Fabric prevails as the bedrock of enterprise digital strategy. Connected and highly available data is the new normal - powering analytics and AI. The data lake itself is commoditized, like raw compute or disk, and becomes an unseen part of the stack. Semantic graph technology is central to Data Fabric initiatives that meaningfully contribute to digital transformation. We share our vision for digital innovation - a shift to something powerful, expedient and future-proof. The Data Fabric connects enterprise data for unprecedented access in an overlay fashion that does not disrupt current investments. Interconnected and reliable data drives business outcomes by automating scalable AI and ML efforts. Graph technology is the way forward to realize this future.

The Convergence of Data & Digital: Mapping Out a Cohesive Strategy for Maximu...

Remy Rosenbaum

Slides from Joe Caserta's Keynote at MIT CDOIQ Symposium 2018 As we continue to shift into a data-driven digital society, it’s crucial to ensure a cohesive strategy between the chief data officer and chief digital officer. In this talk, Joe Caserta will discuss the convergence between data and digital, addressing the interdependencies, ambiguities, and complications between the two. Joe will outline a cohesive strategy to enhance enterprise operations and improve your bottom line.

Solution architecture for big data projects

Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW

Sustainability Investment Research Using Cognitive Analytics

Cambridge Semantics

Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB

Denodo

Data integration is paramount, in this presentation you will find three different paradigms: using client-side tools, creating traditional data warehouses and the data virtualization solution - the logical data warehouse, comparing each other and positioning data virtualization as an integral part of any future-proof IT infrastructure. This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/1q94Ka.

Data Mining and Data Warehousing

Amdocs

Data mining and data warehousing have evolved since the 1960s due to increases in data collection and storage. Data mining automates the extraction of patterns and knowledge from large databases. It uses predictive and descriptive models like classification, clustering, and association rule mining. The data mining process involves problem definition, data preparation, model building, evaluation, and deployment. Data warehouses integrate data from multiple sources for analysis and decision making. They are large, subject-oriented databases designed for querying and analysis rather than transactions. Data warehousing addresses the need to consolidate organizational data spread across various locations and systems.

Data Services Marketplace

Denodo

The data services marketplace is enabled by a data abstraction layer that supports rapid development of operational applications and single data view portals. In this presentation yo will learn services-based reference architecture, modality, and latency of data access. - Reference architecture for enterprise data services marketplace - Modality and latency of data access - Customer use cases and demo This presentation is part of the Denodo Educational Seminar , and you can watch the video here goo.gl/vycYmZ.

Why Data Virtualization? An Introduction by Denodo

Justo Hidalgo

Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning

Cambridge Semantics

Big Data and Data Virtualization

Kenneth Peeples

Red Hat's document discusses using JBoss Data Virtualization to gain better insights from big data. It describes challenges with existing data integration approaches as data sources grow in size, type and location. Red Hat's big data strategy is to reduce the information gap by making all data easily consumable for analytics. JBoss Data Virtualization software virtually unifies data across sources and exposes it to applications through standard interfaces. The demonstration shows integrating social media sentiment data from Hadoop with sales data from MySQL to analyze movie ticket and merchandise sales.

Cortana Analytics Workshop: Azure Data Catalog

MSAdvAnalytics

Julie Strauss. This session introduces the newest services in the Cortana Analytics family. The Azure Data Catalog is an enterprise-wide metadata catalog that enables self-service data source discovery. Data Catalog is a fully managed service that stores, describes, indexes, and provides information on how to access any registered data source in your organization. This session presents an overview of the Data Catalog and how – by using it to register, enrich, discover, understand and consume data sources – you can close the gap between those seeking information and those creating it.

ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML Meetup

Romain Yon

Original event: https://www.meetup.com/NYC-Machine-Learning/events/256605862/ -- "Doing large scale ML in production is hard" – Everyone who's tried This talk is focussed on ML Systems. Especially the less obvious pitfalls, which have caused us troubles at Spotify. This talk assumes a certain level of familiarity with ML: You'll get the most out of if you've some experience with applied ML, ideally on production systems. Romain Yon is a Staff ML Engineer at Spotify. Over the years, Romain has worked on many of the core ML systems that power Spotify today (Music Recommendation, Catalog Quality, Search Ranking, Ads, ..). During the past year, Romain has been mostly focusing on designing reusable ML Infrastructure that can be leveraged throughout Spotify. Prior to Spotify, Romain co-founded the startup https://linkurio.us while getting his MSc in ML from Georgia Tech.

Graph-driven Data Integration: Accelerating and Automating Data Delivery for ...

Cambridge Semantics

In our webinar "A Data Fabric Market Update with Guest Speaker, VP, Principal Analyst Noel Yuhanna" Ben Szekely, Cambridge Semantics’ Co-founder and SVP of Field Operations, and guest speaker, Noel Yuhanna, VP and Principal Analyst at Forrester and author of the “The Forrester Wave™: Enterprise Data Fabric, Q2 2020”, discuss the state of the Data Fabric Market. These are Ben's slides from that webinar.

Supporting Data Services Marketplace using Data Virtualization

Denodo

The document discusses an Enterprise Data Marketplace that would serve as a centralized repository for reusable data assets. It would allow all internal and external data sources to be unified and accessed through a single portal. This marketplace would standardize data access, reduce redundant data retrieval, and provide benefits like governance of data services and an abstraction layer to reduce direct access to source systems. Screenshots are provided of the marketplace's potential capabilities like searching for data assets, a data dictionary, and shopping cart functionality.

Denodo Data Virtualization - IT Days in Luxembourg with Oktopus

Denodo

1) Denodo provides a data virtualization platform that connects disparate data sources and allows users to access and analyze enterprise data without moving or replicating it. 2) Customers like Bank of the West, Intel, and Asurion saw improvements like faster time to market, increased agility, and cost savings by using Denodo to replace ETL processes and create a single access layer for all their data. 3) Denodo's platform provides capabilities for data abstraction, zero replication, performance optimization, data governance, and deployment in multiple locations.

AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...

Cambridge Semantics

Big data it’s impact on the finance function

Mike Davis

Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...

Data Con LA

The document discusses how an Enterprise Data Lake (EDL) provides a more effective solution for enterprise BI and analytics compared to traditional enterprise data warehouses (EDW). It argues that EDL allows enterprises to retain all datasets, service ad-hoc requests with no latency or development time, and offer a low-cost, low-maintenance solution that supports direct analytics and reporting on data stored in its native format. The document promotes EDL as a mainstream solution that should be part of every mid-sized and large enterprise's standard IT stack.

What's hot

Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric

Cambridge Semantics

Creating an Enterprise AI Strategy

AtScale

Modern Data Discovery and Integration in Retail Banking

Cambridge Semantics

Necessity of Data Lakes in the Financial Services Sector

DataWorks Summit

Accelerate Digital Transformation with an Enterprise Big Data Fabric

Cambridge Semantics

The Convergence of Data & Digital: Mapping Out a Cohesive Strategy for Maximu...

Remy Rosenbaum

Solution architecture for big data projects

Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW

Sustainability Investment Research Using Cognitive Analytics

Cambridge Semantics

Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB

Denodo

Data Mining and Data Warehousing

Amdocs

Data Services Marketplace

Denodo

Why Data Virtualization? An Introduction by Denodo

Justo Hidalgo

Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning

Cambridge Semantics

Big Data and Data Virtualization

Kenneth Peeples

Cortana Analytics Workshop: Azure Data Catalog

MSAdvAnalytics

ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML Meetup

Romain Yon

Graph-driven Data Integration: Accelerating and Automating Data Delivery for ...

Cambridge Semantics

Supporting Data Services Marketplace using Data Virtualization

Denodo

Denodo Data Virtualization - IT Days in Luxembourg with Oktopus

Denodo

AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...

Cambridge Semantics

What's hot (20)

Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric

Creating an Enterprise AI Strategy

Modern Data Discovery and Integration in Retail Banking

Necessity of Data Lakes in the Financial Services Sector

Accelerate Digital Transformation with an Enterprise Big Data Fabric

The Convergence of Data & Digital: Mapping Out a Cohesive Strategy for Maximu...

Solution architecture for big data projects

Sustainability Investment Research Using Cognitive Analytics

Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB

Data Mining and Data Warehousing

Data Services Marketplace

Why Data Virtualization? An Introduction by Denodo

Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning

Big Data and Data Virtualization

Cortana Analytics Workshop: Azure Data Catalog

ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML Meetup

Graph-driven Data Integration: Accelerating and Automating Data Delivery for ...

Supporting Data Services Marketplace using Data Virtualization

Denodo Data Virtualization - IT Days in Luxembourg with Oktopus

AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...

Viewers also liked

Big data it’s impact on the finance function

Mike Davis

Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...

Data Con LA

Building enterprise advance analytics platform

Haoran Du

Raymond Fu gave a presentation on building an enterprise analytics platform at the SoCal Data Science Conference. He has over 16 years of experience in big data, business intelligence, and enterprise architecture. He discussed how big data disrupts traditional architecture and requires new skills. Advanced analytics involves creating predictive models through machine learning to enable strategic and operational decisions. An enterprise analytics strategy involves data management, modernizing data platforms, and operationalizing advanced analytics models. Fu outlined the key capabilities needed for data management, analytics creation, and analytics operationalization. He provided examples of reference architectures and services that can be used to build an enterprise analytics platform.

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

Cloudera, Inc.

The enormous legacy of EDW experience and best practices can be adapted to the unique capabilities of the Hadoop environment. In this webinar, in a point-counterpoint format, Dr. Kimball will describe standard data warehouse best practices including the identification of dimensions and facts, managing primary keys, and handling slowly changing dimensions (SCDs) and conformed dimensions. Eli Collins, Chief Technologist at Cloudera, will describe how each of these practices actually can be implemented in Hadoop.

Data Warehouse Design and Best Practices

Ivo Andreev

A data warehouse is a database designed for query and analysis rather than for transaction processing. An appropriate design leads to scalable, balanced and flexible architecture that is capable to meet both present and long-term future needs. This session covers a comparison of the main data warehouse architectures together with best practices for the logical and physical design that support staging, load and querying.

Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...

NICSA

With the proliferation of Big Data-oriented technology and its accompanying applications of advanced statistical techniques, asset managers are enabling their sales and marketing teams with more insight into the preferences and proclivities of their clients, both advisors and investors. This webinar will give attendees a general understanding of Big Data’s technologies and techniques especially as they pertain to using predictive analytics for more effective and targeted marketing and distribution. Desired Outcomes: Understanding Big Data and how it is enabling adopters to use data more effectively than in the past Familiarity with some of the technological and analytical approaches Big Data enables Understanding of attribution models for measuring advisor and investor responsiveness Knowledge of how to prioritize campaigns and contacts by combining measures of valuation and responsiveness Grasp of some of the more effective way to adopt predictive analysis for sales and marketing Understanding basics of recommender systems and how next best action is determined

Big data architectures and the data lake

James Serra

The document provides an overview of big data architectures and the data lake concept. It discusses why organizations are adopting data lakes to handle increasing data volumes and varieties. The key aspects covered include: - Defining top-down and bottom-up approaches to data management - Explaining what a data lake is and how Hadoop can function as the data lake - Describing how a modern data warehouse combines features of a traditional data warehouse and data lake - Discussing how federated querying allows data to be accessed across multiple sources - Highlighting benefits of implementing big data solutions in the cloud - Comparing shared-nothing, massively parallel processing (MPP) architectures to symmetric multi-processing (

Viewers also liked (7)

Big data it’s impact on the finance function

Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...

Building enterprise advance analytics platform

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

Data Warehouse Design and Best Practices

Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...

Big data architectures and the data lake

Similar to Building A Self Service Analytics Platform on Hadoop

Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...

Avinash Ramineni

Enterprises have been rapidly adopting data lakes as a complement or replacement of data warehouses. Many of the Data lake implementations are ignoring the inherent drawbacks and limitations of Data Lakes and ending up as data swamps with little or no benefit to the businesses. In this session we will go through some of challenges and the key aspects that need to be considered for successful Data lake implementations.

Presentation Presentation Presentation Presentation Presentation

bangel105

Managing storage on Prem and in Cloud

Howard Marks

This document discusses managing storage across public and private resources. It covers the evolution of on-site storage management, storage options in the public cloud, and challenges of managing hybrid cloud storage. Key topics include the transition from siloed storage to software-defined storage, various cloud storage services like object storage and block storage, challenges of public cloud limitations, and solutions for connecting on-site and cloud storage like gateways, file systems, and caching appliances.

Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...

Cloudian

This document discusses implementing Hadoop and Elastic MapReduce on Cloudian's scale-out object storage platform. It describes Cloudian's hybrid cloud storage capabilities and how their approach reduces costs and provides faster analytics by analyzing log and event data directly on their storage platform without needing to transform the data for HDFS. Key benefits highlighted include no redundant storage, scaling analytics with storage capacity by adding nodes, and taking advantage of multi-core CPUs for MapReduce tasks.

Move your on prem data to a lake in a Lake in Cloud

CAMMS

With the boom in data; the volume and its complexity, the trend is to move data to the cloud. Where and How do we do this? Azure gives you the answer. In this session, I will give you an introduction to Azure Data Lake and Azure Data Factory, and why they are good for the type of problem we are talking about. You will learn how large datasets can be stored on the cloud, and how you could transport your data to this store. The session will briefly cover Azure Data Lake as the modern warehouse for data on the cloud,

SQL, NoSQL, Distributed SQL: Choose your DataStore carefully

Md Kamaruzzaman

Architecting a datalake

Laurent Leturgez

This document discusses architecting a data lake. It begins by introducing the speaker and topic. It then defines a data lake as a repository that stores enterprise data in its raw format including structured, semi-structured, and unstructured data. The document outlines some key aspects to consider when architecting a data lake such as design, security, data movement, processing, and discovery. It provides an example design and discusses solutions from vendors like AWS, Azure, and GCP. Finally, it includes an example implementation using Azure services for an IoT project that predicts parts failures in trucks.

Apache Geode Meetup, Cork, Ireland at CIT

Apache Geode

This document provides an introduction to Apache Geode (incubating), including: - A brief history of Geode and why it was developed - An overview of key Geode concepts such as regions, caching, and functions - Examples of interesting large-scale use cases from companies like Indian Railways - A demonstration of using Geode with Apache Spark and Spring XD for a stock prediction application - Information on how to get involved with the Geode open source project community

Spark volume requirements 2018

Rachit Arora

This document discusses storage requirements for running Spark workloads on Kubernetes. It recommends using a distributed file system like HDFS or DBFS for distributed storage and emptyDir or NFS for local temp scratch space. Logs can be stored in emptyDir or pushed to object storage. Features that would improve Spark on Kubernetes include image volumes, flexible PV to PVC mappings, encrypted volumes, and clean deletion for compliance. The document provides an overview of Spark, Kubernetes benefits, and typical Spark deployments.

Drupal performance

Gabi Lee

The document discusses optimizing Drupal performance by measuring performance metrics, implementing caching techniques and modules, optimizing database and application code, and configuring web and application servers. It provides an overview of Sergata and their focus on innovation and startups, and recommends analyzing performance bottlenecks and leveraging caching, CDNs, and server configuration to improve performance.

Storage Requirements and Options for Running Spark on Kubernetes

DataWorks Summit

In a world of serverless computing users tend to be frugal when it comes to expenditure on compute, storage and other resources. Paying for the same when they aren’t in use becomes a significant factor. Offering Spark as service on cloud presents very unique challenges. Running Spark on Kubernetes presents a lot of challenges especially around storage and persistence. Spark workloads have very unique requirements of Storage for intermediate data, long time persistence, Share file system and requirements become very tight when it same need to be offered as a service for enterprise to mange GDPR and other compliance like ISO 27001 and HIPAA certifications. This talk covers challenges involved in providing Serverless Spark Clusters share the specific issues one can encounter when running large Kubernetes clusters in production especially covering the scenarios related to persistence. This talk will help people using Kubernetes or docker runtime in production and help them understand various storage options available and which is more suitable for running Spark workloads on Kubernetes and what more can be done

Keynote oracle days final 16x9 v3.alain

Doina Draganescu

This document summarizes Oracle's strategy and product offerings. Oracle's strategy is to provide products that are complete, open, integrated and best-in-class. It highlights several of Oracle's key products, including the Oracle Database, Oracle Fusion Middleware, Oracle Exadata, Oracle Exalogic, Oracle Applications and Oracle server and storage systems. It notes that Oracle's products hold leading positions in their categories and are optimized to work together for better performance and lower costs.

Teradata Loom Introductory Presentation

mlang222

Teradata Loom is a software that helps users realize the full potential of their Hadoop data lakes. It provides data cataloging, profiling, and lineage tracking to help users find, understand, and prepare their data. Loom's active scanning capabilities automatically discover and profile new data. Its interactive Weaver tool allows self-service data wrangling. Loom is integrated with Hadoop and simplifies data lake management to increase analyst productivity.

CC -Unit4.pptx

Revathiparamanathan

Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. It provides the freedom to query data at scale using either serverless or dedicated options. Azure HDInsight allows the use of open source frameworks like Hadoop, Spark, Hive, and Kafka for processing large volumes of data. Azure Databricks offers environments for SQL, data science/engineering, and machine learning. The Azure IoT Hub enables scalable IoT solutions by allowing bidirectional communication between IoT applications and connected devices.

Amazon Redshift with Full 360 Inc.

Amazon Web Services

Full 360 is a cloud consulting firm that provides big data, API/UX, and cloud operations services. They helped a customer migrate their data from Netezza to Redshift, building a structured data lake and optimizing queries for equivalent or better performance. Lessons from the project included data standardization, tuning techniques like encoding and sort keys, and creating reusable ingestion processes. The migration reduced license costs and improved operational flexibility.

What are clouds made from

John Garbutt

Clouds are made of on-demand, scalable computing resources that are accessed as a service via the internet. There are different cloud deployment models (public, private, hybrid) and service models (IaaS, PaaS, SaaS). Infrastructure as a service (IaaS) clouds provide fundamental computing resources like storage, networking and virtual machines, while platform as a service (PaaS) clouds provide additional services like databases, messaging queues and development tools. Choosing between IaaS and PaaS involves considering factors like lock-in to the cloud vendor, control over the infrastructure, and application requirements.

Apache Geode Meetup, London

Apache Geode

Apache Geode is an open source in-memory data grid that provides data distribution, replication and high availability. It can be used for caching, messaging and interactive queries. The presentation discusses Geode concepts like cache, region and member. It provides examples of how large companies use Geode for applications requiring real-time response, high concurrency and global data visibility. Geode's performance comes from minimizing data copying and contention through flexible consistency and partitioning. The project is now hosted by Apache and the community is encouraged to get involved through mailing lists, code contributions and example applications.

What is Cloud computing?

Richard Harvey

Cloud computing provides on-demand access to computing resources like storage, networking, and servers that can be rapidly provisioned without long wait times. There are public clouds run by third parties and private clouds within a company's own data center. Public clouds offer elastic resources without large upfront costs but less control, while private clouds offer more control within existing infrastructure limitations. Major cloud providers like Amazon Web Services offer infrastructure as a service (IaaS) like computing and storage, and platform as a service (PaaS) that automates services like databases.

Accelerating Business Intelligence Solutions with Microsoft Azure pass

Jason Strate

Business Intelligence (BI) solutions need to move at the speed of business. Unfortunately, roadblocks related to availability of resources and deployment often present an issue. What if you could accelerate the deployment of an entire BI infrastructure to just a couple hours and start loading data into it by the end of the day. In this session, we'll demonstrate how to leverage Microsoft tools and the Azure cloud environment to build out a BI solution and begin providing analytics to your team with tools such as Power BI. By end of the session, you'll gain an understanding of the capabilities of Azure and how you can start building an end to end BI proof-of-concept today.

Data Science Day New York: Data Science: A Personal History

Cloudera, Inc.

Similar to Building A Self Service Analytics Platform on Hadoop (20)

Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...

Presentation Presentation Presentation Presentation Presentation

Managing storage on Prem and in Cloud

Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...

Move your on prem data to a lake in a Lake in Cloud

SQL, NoSQL, Distributed SQL: Choose your DataStore carefully

Architecting a datalake

Apache Geode Meetup, Cork, Ireland at CIT

Spark volume requirements 2018

Drupal performance

Storage Requirements and Options for Running Spark on Kubernetes

Keynote oracle days final 16x9 v3.alain

Teradata Loom Introductory Presentation

CC -Unit4.pptx

Amazon Redshift with Full 360 Inc.

What are clouds made from

Apache Geode Meetup, London

What is Cloud computing?

Accelerating Business Intelligence Solutions with Microsoft Azure pass

Data Science Day New York: Data Science: A Personal History

Recently uploaded

Monitoring and Managing Anomaly Detection on OpenShift.pdf

Tosin Akinosho

Monitoring and Managing Anomaly Detection on OpenShift Overview Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices. Key Topics Covered 1. Introduction to Anomaly Detection - Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems. 2. Understanding Edge (IoT) - Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source. 3. What is ArgoCD? - Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices. 4. Deployment Using ArgoCD for Edge Devices - Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD. 5. Introduction to Apache Kafka and S3 - Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions. 6. Viewing Kafka Messages in the Data Lake - Learn how to view and analyze Kafka messages stored in a data lake for better insights. 7. What is Prometheus? - Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices. 8. Monitoring Application Metrics with Prometheus - Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system. 9. What is Camel K? - Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes. 10. Configuring Camel K Integrations for Data Pipelines - Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow. 11. What is a Jupyter Notebook? - Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text. 12. Jupyter Notebooks with Code Examples - Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.

Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...

saastr

HCL Notes and Domino License Cost Reduction in the World of DLAU

panagenda

Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/ The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this! We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model. Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward. These topics will be covered - Reducing license cost by finding and fixing misconfigurations and superfluous accounts - How do CCB and CCX licenses really work? - Understanding the DLAU tool and how to best utilize it - Tips for common problem areas, like team mailboxes, functional/test users, etc - Practical examples and best practices to implement right away

Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe

Precisely

Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market. Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.

Energy Efficient Video Encoding for Cloud and Edge Computing Instances

Alpen-Adria-Universität

AppSec PNW: Android and iOS Application Security with MobSF

Ajin Abraham

Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application. This talk covers: Using MobSF for static analysis of mobile applications. Interactive dynamic security assessment of Android and iOS applications. Solving Mobile app CTF challenges. Reverse engineering and runtime analysis of Mobile malware. How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.

GNSS spoofing via SDR (Criptored Talks 2024)

Javier Junquera

In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security. This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing. The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.

Taking AI to the Next Level in Manufacturing.pdf

ssuserfac0301

Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as: 1. How quickly AI is being implemented in manufacturing. 2. Which barriers stand in the way of AI adoption. 3. How data quality and governance form the backbone of AI. 4. Organizational processes and structures that may inhibit effective AI adoption. 6. Ideas and approaches to help build your organization's AI strategy.

Principle of conventional tomography-Bibash Shahi ppt..pptx

BibashShahi

Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...

saastr

Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors

DianaGray10

Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more. The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications. We’ll discuss and demo the benefits of UiPath Apps and connectors including: Creating a compelling user experience for any software, without the limitations of APIs. Accelerating the app creation process, saving time and effort Enjoying high-performance CRUD (create, read, update, delete) operations, for seamless data management. Speakers: Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP Charlie Greenberg, host

5th LF Energy Power Grid Model Meet-up Slides

DanBrown980551

5th Power Grid Model Meet-up It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology. Power Grid Model The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services. Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability. Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization. What to expect For the upcoming meetup we are organizing, we have an exciting lineup of activities planned: -Insightful presentations covering two practical applications of the Power Grid Model. -An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024. -An interactive brainstorming session to discuss and propose new feature requests. -An opportunity to connect with fellow Power Grid Model enthusiasts and users.

The Microsoft 365 Migration Tutorial For Beginner.pptx

operationspcvita

Leveraging the Graph for Clinical Trials and Standards

Neo4j

Apps Break Data

Ivo Velitchkov

How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?

Artificial Intelligence and Electronic Warfare

Papadakis K.-Cyber-Information Warfare Analyst & Cyber Defense/Security Consultant-Hellenic MoD

JavaLand 2024: Application Development Green Masterplan

Miro Wengner

Columbus Data & Analytics Wednesdays - June 2024

Jason Packer

Harnessing the Power of NLP and Knowledge Graphs for Opioid Research

Neo4j

Programming Foundation Models with DSPy - Meetup Slides

Zilliz

Recently uploaded (20)

Monitoring and Managing Anomaly Detection on OpenShift.pdf

Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...

HCL Notes and Domino License Cost Reduction in the World of DLAU

Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe

Energy Efficient Video Encoding for Cloud and Edge Computing Instances

AppSec PNW: Android and iOS Application Security with MobSF

GNSS spoofing via SDR (Criptored Talks 2024)

Taking AI to the Next Level in Manufacturing.pdf

Principle of conventional tomography-Bibash Shahi ppt..pptx

Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...

Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors

5th LF Energy Power Grid Model Meet-up Slides

The Microsoft 365 Migration Tutorial For Beginner.pptx

Leveraging the Graph for Clinical Trials and Standards

Apps Break Data

Artificial Intelligence and Electronic Warfare

JavaLand 2024: Application Development Green Masterplan

Columbus Data & Analytics Wednesdays - June 2024

Harnessing the Power of NLP and Knowledge Graphs for Opioid Research

Programming Foundation Models with DSPy - Meetup Slides

Building A Self Service Analytics Platform on Hadoop

1. 1Page Building a Self Service Analytics Platform on Hadoop Avinash Ramineni

2. 2Page Clairvoyant

3. 3Page Clairvoyant Services

4. 4Page Quick Poll • Big Data Deployments in Prod • Hadoop Distributions • People use Ecosystems rather than tools • Architecture was implemented on Cloudera • Cloud Experience – AWS ?

5. 5Page Challenges • Data in Silos • Acquires Perspectives as data is moved • Data availability delays • Legacy Systems handling the Volume , Veracity and Velocity • Extracting data from legacy systems • Lack of Self-Service Capabilities • Knowledge becomes tribal – instead of institutional • Security / Compliance Requirements

6. 6Page Data Lake Attributes • Data Democratization • Data Discovery • Data Lineage • Self-Service capabilities • Metadata Management

7. 7Page Without Self-Service

8. 8Page Self-Service at all Levels Ingest Organize Enrich Analyze Dashboards AnalyzeIngest Organize Enrich Insights

9. 9Page Key Design Tenets • Separation of Compute and Storage • Independently scale compute and storage • Data Democratization and Governance • Bring your own Compute (BYOC) • HA / DR • Open Source Stack

10. 1 0 Page Separation of Compute and Storage • Scale storage and compute independently • Shifts bottleneck from Disk IO to Network • Centralized Data Storage • Data Democratization • No data duplication • Easier Hardware upgrade paths • Flexible Architecture • DR Simplified

11. 1 1 Page BYOC (Bring Your Own Cluster) • Each department/application can bring its own Hadoop cluster • Eliminates the need for very large clusters • Easier to administer and maintain • Reduces multi-tenancy issues • Clusters can be upgraded independently • Enables usage based cost model Centralized / Common S3 Storage Marketing Cluster Centralized Storage Personalization Cluster Main Cluster

12. 1 2 Page Architecture

13. 1 3 Page Architecture – Data Ingestion Layer • DB Ingestor • Stream Ingestor • Kafka and Spark Streaming • File Ingestor • FTP / SFTP / Logs • Ingestion using Service API

14. 1 4 Page Architecture – Data Processing Layer • Storage layer carved into logical buckets • Landing, Raw, Derived and Delivery • Schema stored with data (no guesswork) • Platform Jobs • Converting text to Parquet • Saving streaming data Parquet • Derivatives • Compaction • Standardization

15. 1 5 Page Architecture – Data Delivery Layer • Data Delivery • SQL - Spark Thrift Server / Impala • Tableau, SQL IDE, Applications • Self Service • Derivatives • Represented Via SQL on Delivery Layer • Stored in Derived Storage Layer • Metadata driven • Derived Layer Generators • Long running Spark Job • Derivative Refresh

16. 1 6 Page Key Takeaways - Cloud • Hadoop Cloud ready-ness • Cloudera Director Limitations • Multi-Availability zone, regions • Storage • Instance Storage • EBS Volumes • gp2 vs st1 • S3 Eventual Consistency

17. 1 7 Page Key Takeaways - Spark Thrift Server • Spark Thrift Server Support • Performance Tuning • Concurrency • partition strategy • Cache Tables • Compression Codec for Parquet • Snappy vs gzip

18. 1 8 Page Key Takeaways - Security • Secure by Design, Secure by Default • Access to Data on S3 • IAM Roles • Sentry • Support for Spark • Kerberos • Spark Thrift Server • Navigator • Support for Spark

19. 1 9 Page Key Takeaways - General • Rapidly Changing Technology • Feature addition • Documentation • Bugs • Jar hell • Small files • Performance Issues • Compaction

20. 2 0 Page Key Takeaways - General • Partition Strategy • Parquet Files • Balancing parallelism and throughput • Table Partitions • Cluster sizing, optimization and tuning • Integrating with Corporate infrastructure • Deployment practices • Monitoring and Alerting • Information Security Policies

21. 2 1 Page Data Security

22. 2 2 Page Questions • Principal @ Clairvoyant • Email: avinash@clairvoyantsoft.com • LinkedIn: https://www.linkedin.com/in/avinashramineni

Building A Self Service Analytics Platform on Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Building A Self Service Analytics Platform on Hadoop

Similar to Building A Self Service Analytics Platform on Hadoop (20)

Recently uploaded

Recently uploaded (20)

Building A Self Service Analytics Platform on Hadoop