These slides were presented by Avinash Ramineni of Clairvoyant to the Atlanta Apache Spark User Group on Wednesday, March 22, 2017: https://www.meetup.com/Atlanta-Apache-Spark-User-Group/events/238109721/
Accelerating Insight - Smart Data Lake Customer Success StoriesCambridge Semantics
At Gartner Data & Analytics Summit 2017 Alok Prasad, President, was joined by Peter Horowitz of PricewaterhouseCoopers in presenting a session on how Cambridge Semantics' in-memory, massively parallel, semantic graph-based platform delivers an accelerating edge to data-driven organizations, while maintaining trust with security and governance.
This document provides an overview of Anzo Unstructured, a natural language processing (NLP) platform from Cambridge Semantics. It discusses the core capabilities of Anzo Unstructured, including intake of various file formats, extraction of entities and relationships, and semantic analysis. It also outlines example use cases in pharma and finance. The document demonstrates the configuration and visualization of Anzo Unstructured pipelines and annotations.
This document provides an overview and introduction to Cambridge Semantics Inc. and their Anzo Smart Data Platform for building smart data lakes using semantics. Key points include:
- Cambridge Semantics was founded in 2007 and their Anzo software suite uses open semantic web standards to create data analytics and management solutions from diverse data sources.
- While data lakes make it easy to assemble large volumes of data, identifying and linking data across sources remains challenging without harmonization of meanings. Semantic models and tools can help address these issues.
- The Anzo Analytics and Data Integration Suite uses business understandable semantic models to describe, search, query and analyze data from various structured and unstructured sources to build a smart data lake.
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®Cambridge Semantics
The document discusses how Anzo Smart Data Lake can help government agencies transform data management and increase time to insight. It provides an overview of Anzo and how it uses semantic knowledge graphs to link and harmonize diverse data sources for self-service data preparation, discovery, and analytics. Examples are given of how Anzo has helped organizations in intelligence and defense integrate data sources and gain better visibility into areas like contract performance. The presentation concludes by discussing how Anzo could help agencies drive business efficiency, enable more self-service for citizens using public data, and suggests next steps of proof of concept or proposal.
The document discusses how traditional analytics approaches are no longer sufficient due to new data sources like machine data that are unstructured and from external sources. It introduces Splunk as a platform that can collect, index, and analyze massive amounts of machine data in real-time to provide operational intelligence and business insights. Splunk uses late binding schema to allow ad-hoc queries over heterogeneous machine data without needing to design schemas upfront. It can complement traditional BI tools by focusing on real-time analytics over machine data while traditional tools focus on structured data.
Using Cloud Automation Technologies to Deliver an Enterprise Data FabricCambridge Semantics
The world of database management is changing. Cloud adoption is accelerating, offering a path for companies to increase their database capabilities while keeping costs in line. To help IT decision-makers survive and thrive in the cloud era, DBTA hosted this special roundtable webinar.
Graph technology has truly burst onto the scene with diverse new products and services, proving that graph is relevant and that not all graph use cases are equal. Previously relegated to niche implementations and science projects, graph now finds itself deployed as the foundational technology for enterprise analytics solutions and enterprise Data Fabric strategies. It is no surprise that many are calling 2018 “The Year of the Graph”.
This document discusses how to build a successful data lake by focusing on the right data, platform, and interface. It emphasizes the importance of saving raw data to analyze later, organizing the data lake into zones with different governance levels, and providing self-service tools to find, understand, provision, prepare, and analyze data. It promotes the use of a smart data catalog like Waterline Data to automate metadata tagging, enable data discovery and collaboration, and maximize business value from the data lake.
Accelerating Insight - Smart Data Lake Customer Success StoriesCambridge Semantics
At Gartner Data & Analytics Summit 2017 Alok Prasad, President, was joined by Peter Horowitz of PricewaterhouseCoopers in presenting a session on how Cambridge Semantics' in-memory, massively parallel, semantic graph-based platform delivers an accelerating edge to data-driven organizations, while maintaining trust with security and governance.
This document provides an overview of Anzo Unstructured, a natural language processing (NLP) platform from Cambridge Semantics. It discusses the core capabilities of Anzo Unstructured, including intake of various file formats, extraction of entities and relationships, and semantic analysis. It also outlines example use cases in pharma and finance. The document demonstrates the configuration and visualization of Anzo Unstructured pipelines and annotations.
This document provides an overview and introduction to Cambridge Semantics Inc. and their Anzo Smart Data Platform for building smart data lakes using semantics. Key points include:
- Cambridge Semantics was founded in 2007 and their Anzo software suite uses open semantic web standards to create data analytics and management solutions from diverse data sources.
- While data lakes make it easy to assemble large volumes of data, identifying and linking data across sources remains challenging without harmonization of meanings. Semantic models and tools can help address these issues.
- The Anzo Analytics and Data Integration Suite uses business understandable semantic models to describe, search, query and analyze data from various structured and unstructured sources to build a smart data lake.
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®Cambridge Semantics
The document discusses how Anzo Smart Data Lake can help government agencies transform data management and increase time to insight. It provides an overview of Anzo and how it uses semantic knowledge graphs to link and harmonize diverse data sources for self-service data preparation, discovery, and analytics. Examples are given of how Anzo has helped organizations in intelligence and defense integrate data sources and gain better visibility into areas like contract performance. The presentation concludes by discussing how Anzo could help agencies drive business efficiency, enable more self-service for citizens using public data, and suggests next steps of proof of concept or proposal.
The document discusses how traditional analytics approaches are no longer sufficient due to new data sources like machine data that are unstructured and from external sources. It introduces Splunk as a platform that can collect, index, and analyze massive amounts of machine data in real-time to provide operational intelligence and business insights. Splunk uses late binding schema to allow ad-hoc queries over heterogeneous machine data without needing to design schemas upfront. It can complement traditional BI tools by focusing on real-time analytics over machine data while traditional tools focus on structured data.
Using Cloud Automation Technologies to Deliver an Enterprise Data FabricCambridge Semantics
The world of database management is changing. Cloud adoption is accelerating, offering a path for companies to increase their database capabilities while keeping costs in line. To help IT decision-makers survive and thrive in the cloud era, DBTA hosted this special roundtable webinar.
Graph technology has truly burst onto the scene with diverse new products and services, proving that graph is relevant and that not all graph use cases are equal. Previously relegated to niche implementations and science projects, graph now finds itself deployed as the foundational technology for enterprise analytics solutions and enterprise Data Fabric strategies. It is no surprise that many are calling 2018 “The Year of the Graph”.
This document discusses how to build a successful data lake by focusing on the right data, platform, and interface. It emphasizes the importance of saving raw data to analyze later, organizing the data lake into zones with different governance levels, and providing self-service tools to find, understand, provision, prepare, and analyze data. It promotes the use of a smart data catalog like Waterline Data to automate metadata tagging, enable data discovery and collaboration, and maximize business value from the data lake.
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricCambridge Semantics
Watch this webinar to learn about the benefits of using semantic and graph database technology to create a Data Catalog of all of an enterprise's data, regardless of source or format, as part of a modern IT or data management stack and an important step toward building an Enterprise Data Fabric.
When it comes to creating an enterprise AI strategy: if your company isn’t good at analytics, it’s not ready for AI. Succeeding in AI requires being good at data engineering AND analytics. Unfortunately, management teams often assume they can leapfrog best practices for basic data analytics by directly adopting advanced technologies such as ML/AI – setting themselves up for failure from the get-go. This presentation explains how to get basic data engineering and the right technology in place to create and maintain data pipelines so that you can solve problems with AI successfully.
Retail banks are moving beyond the data warehouse and data lake and are now implementing data fabric architectures to address data discovery and integration challenges.
These are the slides from our webinar "Modern Data Discovery and Integration in Retail Banking" in which we explore the role of the data discovery and integration layer in a data fabric with special focus on evolution from data warehouse to data fabric, semantics and graph data models in data fabric and example use cases in retail banks and B2C financial services.
Necessity of Data Lakes in the Financial Services SectorDataWorks Summit
With the emergence of regulations such as the General Data Protection Regulation from the European Union (effective May 2018), with fines up to 20m Euro, Data Lakes are emerging as the data architecture of choice amongst financial institutions. Banks are embarking on a journey to enable data scientists to unlock the value of the data silo'ed in many disparate data systems. By enabling self service data access and merging multiple streams of data by using data clustering, entity extraction, identity resolution and other techniques - we will show how banks have used Analytics to uncover business value without falling into the abyss of data swamps. The build out of the data lake requires the ingestion of data from multiple operational systems . By leveraging an automated Data Cataloging service, organizations are able to search, profile, discover, tag, track lineage and capture tribal knowledge delivered on the FICO Analytics Cloud enabling the data scientists to build innovative models, make automated decisions, track fraudulent usage, make intelligent marketing campaigns and improve the top line and bottom line for the financial institution.
Speaker:
Rohit Valia, Product Management and Strategy, Fico
Accelerate Digital Transformation with an Enterprise Big Data FabricCambridge Semantics
In this webinar by Cambridge Semantics' VP of Solution Engineering, Ben Szekely, you will learn more about how the Enterprise Data Fabric prevails as the bedrock of enterprise digital strategy. Connected and highly available data is the new normal - powering analytics and AI. The data lake itself is commoditized, like raw compute or disk, and becomes an unseen part of the stack. Semantic graph technology is central to Data Fabric initiatives that meaningfully contribute to digital transformation.
We share our vision for digital innovation - a shift to something powerful, expedient and future-proof. The Data Fabric connects enterprise data for unprecedented access in an overlay fashion that does not disrupt current investments. Interconnected and reliable data drives business outcomes by automating scalable AI and ML efforts. Graph technology is the way forward to realize this future.
The Convergence of Data & Digital: Mapping Out a Cohesive Strategy for Maximu...Remy Rosenbaum
Slides from Joe Caserta's Keynote at MIT CDOIQ Symposium 2018
As we continue to shift into a data-driven digital society, it’s crucial to ensure a cohesive strategy
between the chief data officer and chief digital officer. In this talk, Joe Caserta will discuss the
convergence between data and digital, addressing the interdependencies, ambiguities, and
complications between the two. Joe will outline a cohesive strategy to enhance enterprise operations
and improve your bottom line.
Solution architecture for big data projects
solution architecture,big data,hadoop,hive,hbase,impala,spark,apache,cassandra,SAP HANA,Cognos big insights
Sustainability Investment Research Using Cognitive AnalyticsCambridge Semantics
In this webinar Anthony J. Sarkis, Chief Strategy Officer at Parabole, and Steve Sarsfield, VP Product at Cambridge Semantics, explore how portfolio managers are using the recently developed Parabole/ AnzoGraph DB integration as their underlying infrastructure for conducting ML and cognitive analytics at scale to exploit data to identify potential risks and new opportunities.
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBDenodo
Data integration is paramount, in this presentation you will find three different paradigms: using client-side tools, creating traditional data warehouses and the data virtualization solution - the logical data warehouse, comparing each other and positioning data virtualization as an integral part of any future-proof IT infrastructure.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/1q94Ka.
Data mining and data warehousing have evolved since the 1960s due to increases in data collection and storage. Data mining automates the extraction of patterns and knowledge from large databases. It uses predictive and descriptive models like classification, clustering, and association rule mining. The data mining process involves problem definition, data preparation, model building, evaluation, and deployment. Data warehouses integrate data from multiple sources for analysis and decision making. They are large, subject-oriented databases designed for querying and analysis rather than transactions. Data warehousing addresses the need to consolidate organizational data spread across various locations and systems.
The data services marketplace is enabled by a data abstraction layer that supports rapid development of operational applications and single data view portals. In this presentation yo will learn services-based reference architecture, modality, and latency of data access.
- Reference architecture for enterprise data services marketplace
- Modality and latency of data access
- Customer use cases and demo
This presentation is part of the Denodo Educational Seminar , and you can watch the video here goo.gl/vycYmZ.
Why Data Virtualization? An Introduction by DenodoJusto Hidalgo
Data Virtualization means Real-time Data Access and Integration. But why do I need it? This presentation tries to answer it in a simple yet clear way.
By Alberto Pan, CTO of Denodo, and Justo Hidalgo, VP Product Management.
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningCambridge Semantics
This EDM Council webinar, sponsored by Cambridge Semantics Inc. and featuring FI Consulting, explores the challenges common to a risk analytics pipeline, application of graph analytics to mortgage loan data and use cases in adjacent areas including customer service, collections, fraud and AML.
Red Hat's document discusses using JBoss Data Virtualization to gain better insights from big data. It describes challenges with existing data integration approaches as data sources grow in size, type and location. Red Hat's big data strategy is to reduce the information gap by making all data easily consumable for analytics. JBoss Data Virtualization software virtually unifies data across sources and exposes it to applications through standard interfaces. The demonstration shows integrating social media sentiment data from Hadoop with sales data from MySQL to analyze movie ticket and merchandise sales.
Cortana Analytics Workshop: Azure Data CatalogMSAdvAnalytics
Julie Strauss. This session introduces the newest services in the Cortana Analytics family. The Azure Data Catalog is an enterprise-wide metadata catalog that enables self-service data source discovery. Data Catalog is a fully managed service that stores, describes, indexes, and provides information on how to access any registered data source in your organization. This session presents an overview of the Data Catalog and how – by using it to register, enrich, discover, understand and consume data sources – you can close the gap between those seeking information and those creating it.
ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML MeetupRomain Yon
Original event: https://www.meetup.com/NYC-Machine-Learning/events/256605862/
--
"Doing large scale ML in production is hard" – Everyone who's tried
This talk is focussed on ML Systems. Especially the less obvious pitfalls, which have caused us troubles at Spotify.
This talk assumes a certain level of familiarity with ML: You'll get the most out of if you've some experience with applied ML, ideally on production systems.
Romain Yon is a Staff ML Engineer at Spotify. Over the years, Romain has worked on many of the core ML systems that power Spotify today (Music Recommendation, Catalog Quality, Search Ranking, Ads, ..).
During the past year, Romain has been mostly focusing on designing reusable ML Infrastructure that can be leveraged throughout Spotify.
Prior to Spotify, Romain co-founded the startup https://linkurio.us while getting his MSc in ML from Georgia Tech.
Graph-driven Data Integration: Accelerating and Automating Data Delivery for ...Cambridge Semantics
In our webinar "A Data Fabric Market Update with Guest Speaker, VP, Principal Analyst Noel Yuhanna" Ben Szekely, Cambridge Semantics’ Co-founder and SVP of Field Operations, and guest speaker, Noel Yuhanna, VP and Principal Analyst at Forrester and author of the “The Forrester Wave™: Enterprise Data Fabric, Q2 2020”, discuss the state of the Data Fabric Market. These are Ben's slides from that webinar.
Supporting Data Services Marketplace using Data VirtualizationDenodo
The document discusses an Enterprise Data Marketplace that would serve as a centralized repository for reusable data assets. It would allow all internal and external data sources to be unified and accessed through a single portal. This marketplace would standardize data access, reduce redundant data retrieval, and provide benefits like governance of data services and an abstraction layer to reduce direct access to source systems. Screenshots are provided of the marketplace's potential capabilities like searching for data assets, a data dictionary, and shopping cart functionality.
Denodo Data Virtualization - IT Days in Luxembourg with OktopusDenodo
1) Denodo provides a data virtualization platform that connects disparate data sources and allows users to access and analyze enterprise data without moving or replicating it.
2) Customers like Bank of the West, Intel, and Asurion saw improvements like faster time to market, increased agility, and cost savings by using Denodo to replace ETL processes and create a single access layer for all their data.
3) Denodo's platform provides capabilities for data abstraction, zero replication, performance optimization, data governance, and deployment in multiple locations.
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...Cambridge Semantics
Thomas Cook, director of sales, Cambridge Semantics, offers a primer on graph database technology and the rapid growth of knowledge graphs at Data Summit 2020 in his presentation titled "AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Connected World".
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Data Con LA
The document discusses how an Enterprise Data Lake (EDL) provides a more effective solution for enterprise BI and analytics compared to traditional enterprise data warehouses (EDW). It argues that EDL allows enterprises to retain all datasets, service ad-hoc requests with no latency or development time, and offer a low-cost, low-maintenance solution that supports direct analytics and reporting on data stored in its native format. The document promotes EDL as a mainstream solution that should be part of every mid-sized and large enterprise's standard IT stack.
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricCambridge Semantics
Watch this webinar to learn about the benefits of using semantic and graph database technology to create a Data Catalog of all of an enterprise's data, regardless of source or format, as part of a modern IT or data management stack and an important step toward building an Enterprise Data Fabric.
When it comes to creating an enterprise AI strategy: if your company isn’t good at analytics, it’s not ready for AI. Succeeding in AI requires being good at data engineering AND analytics. Unfortunately, management teams often assume they can leapfrog best practices for basic data analytics by directly adopting advanced technologies such as ML/AI – setting themselves up for failure from the get-go. This presentation explains how to get basic data engineering and the right technology in place to create and maintain data pipelines so that you can solve problems with AI successfully.
Retail banks are moving beyond the data warehouse and data lake and are now implementing data fabric architectures to address data discovery and integration challenges.
These are the slides from our webinar "Modern Data Discovery and Integration in Retail Banking" in which we explore the role of the data discovery and integration layer in a data fabric with special focus on evolution from data warehouse to data fabric, semantics and graph data models in data fabric and example use cases in retail banks and B2C financial services.
Necessity of Data Lakes in the Financial Services SectorDataWorks Summit
With the emergence of regulations such as the General Data Protection Regulation from the European Union (effective May 2018), with fines up to 20m Euro, Data Lakes are emerging as the data architecture of choice amongst financial institutions. Banks are embarking on a journey to enable data scientists to unlock the value of the data silo'ed in many disparate data systems. By enabling self service data access and merging multiple streams of data by using data clustering, entity extraction, identity resolution and other techniques - we will show how banks have used Analytics to uncover business value without falling into the abyss of data swamps. The build out of the data lake requires the ingestion of data from multiple operational systems . By leveraging an automated Data Cataloging service, organizations are able to search, profile, discover, tag, track lineage and capture tribal knowledge delivered on the FICO Analytics Cloud enabling the data scientists to build innovative models, make automated decisions, track fraudulent usage, make intelligent marketing campaigns and improve the top line and bottom line for the financial institution.
Speaker:
Rohit Valia, Product Management and Strategy, Fico
Accelerate Digital Transformation with an Enterprise Big Data FabricCambridge Semantics
In this webinar by Cambridge Semantics' VP of Solution Engineering, Ben Szekely, you will learn more about how the Enterprise Data Fabric prevails as the bedrock of enterprise digital strategy. Connected and highly available data is the new normal - powering analytics and AI. The data lake itself is commoditized, like raw compute or disk, and becomes an unseen part of the stack. Semantic graph technology is central to Data Fabric initiatives that meaningfully contribute to digital transformation.
We share our vision for digital innovation - a shift to something powerful, expedient and future-proof. The Data Fabric connects enterprise data for unprecedented access in an overlay fashion that does not disrupt current investments. Interconnected and reliable data drives business outcomes by automating scalable AI and ML efforts. Graph technology is the way forward to realize this future.
The Convergence of Data & Digital: Mapping Out a Cohesive Strategy for Maximu...Remy Rosenbaum
Slides from Joe Caserta's Keynote at MIT CDOIQ Symposium 2018
As we continue to shift into a data-driven digital society, it’s crucial to ensure a cohesive strategy
between the chief data officer and chief digital officer. In this talk, Joe Caserta will discuss the
convergence between data and digital, addressing the interdependencies, ambiguities, and
complications between the two. Joe will outline a cohesive strategy to enhance enterprise operations
and improve your bottom line.
Solution architecture for big data projects
solution architecture,big data,hadoop,hive,hbase,impala,spark,apache,cassandra,SAP HANA,Cognos big insights
Sustainability Investment Research Using Cognitive AnalyticsCambridge Semantics
In this webinar Anthony J. Sarkis, Chief Strategy Officer at Parabole, and Steve Sarsfield, VP Product at Cambridge Semantics, explore how portfolio managers are using the recently developed Parabole/ AnzoGraph DB integration as their underlying infrastructure for conducting ML and cognitive analytics at scale to exploit data to identify potential risks and new opportunities.
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBDenodo
Data integration is paramount, in this presentation you will find three different paradigms: using client-side tools, creating traditional data warehouses and the data virtualization solution - the logical data warehouse, comparing each other and positioning data virtualization as an integral part of any future-proof IT infrastructure.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/1q94Ka.
Data mining and data warehousing have evolved since the 1960s due to increases in data collection and storage. Data mining automates the extraction of patterns and knowledge from large databases. It uses predictive and descriptive models like classification, clustering, and association rule mining. The data mining process involves problem definition, data preparation, model building, evaluation, and deployment. Data warehouses integrate data from multiple sources for analysis and decision making. They are large, subject-oriented databases designed for querying and analysis rather than transactions. Data warehousing addresses the need to consolidate organizational data spread across various locations and systems.
The data services marketplace is enabled by a data abstraction layer that supports rapid development of operational applications and single data view portals. In this presentation yo will learn services-based reference architecture, modality, and latency of data access.
- Reference architecture for enterprise data services marketplace
- Modality and latency of data access
- Customer use cases and demo
This presentation is part of the Denodo Educational Seminar , and you can watch the video here goo.gl/vycYmZ.
Why Data Virtualization? An Introduction by DenodoJusto Hidalgo
Data Virtualization means Real-time Data Access and Integration. But why do I need it? This presentation tries to answer it in a simple yet clear way.
By Alberto Pan, CTO of Denodo, and Justo Hidalgo, VP Product Management.
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningCambridge Semantics
This EDM Council webinar, sponsored by Cambridge Semantics Inc. and featuring FI Consulting, explores the challenges common to a risk analytics pipeline, application of graph analytics to mortgage loan data and use cases in adjacent areas including customer service, collections, fraud and AML.
Red Hat's document discusses using JBoss Data Virtualization to gain better insights from big data. It describes challenges with existing data integration approaches as data sources grow in size, type and location. Red Hat's big data strategy is to reduce the information gap by making all data easily consumable for analytics. JBoss Data Virtualization software virtually unifies data across sources and exposes it to applications through standard interfaces. The demonstration shows integrating social media sentiment data from Hadoop with sales data from MySQL to analyze movie ticket and merchandise sales.
Cortana Analytics Workshop: Azure Data CatalogMSAdvAnalytics
Julie Strauss. This session introduces the newest services in the Cortana Analytics family. The Azure Data Catalog is an enterprise-wide metadata catalog that enables self-service data source discovery. Data Catalog is a fully managed service that stores, describes, indexes, and provides information on how to access any registered data source in your organization. This session presents an overview of the Data Catalog and how – by using it to register, enrich, discover, understand and consume data sources – you can close the gap between those seeking information and those creating it.
ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML MeetupRomain Yon
Original event: https://www.meetup.com/NYC-Machine-Learning/events/256605862/
--
"Doing large scale ML in production is hard" – Everyone who's tried
This talk is focussed on ML Systems. Especially the less obvious pitfalls, which have caused us troubles at Spotify.
This talk assumes a certain level of familiarity with ML: You'll get the most out of if you've some experience with applied ML, ideally on production systems.
Romain Yon is a Staff ML Engineer at Spotify. Over the years, Romain has worked on many of the core ML systems that power Spotify today (Music Recommendation, Catalog Quality, Search Ranking, Ads, ..).
During the past year, Romain has been mostly focusing on designing reusable ML Infrastructure that can be leveraged throughout Spotify.
Prior to Spotify, Romain co-founded the startup https://linkurio.us while getting his MSc in ML from Georgia Tech.
Graph-driven Data Integration: Accelerating and Automating Data Delivery for ...Cambridge Semantics
In our webinar "A Data Fabric Market Update with Guest Speaker, VP, Principal Analyst Noel Yuhanna" Ben Szekely, Cambridge Semantics’ Co-founder and SVP of Field Operations, and guest speaker, Noel Yuhanna, VP and Principal Analyst at Forrester and author of the “The Forrester Wave™: Enterprise Data Fabric, Q2 2020”, discuss the state of the Data Fabric Market. These are Ben's slides from that webinar.
Supporting Data Services Marketplace using Data VirtualizationDenodo
The document discusses an Enterprise Data Marketplace that would serve as a centralized repository for reusable data assets. It would allow all internal and external data sources to be unified and accessed through a single portal. This marketplace would standardize data access, reduce redundant data retrieval, and provide benefits like governance of data services and an abstraction layer to reduce direct access to source systems. Screenshots are provided of the marketplace's potential capabilities like searching for data assets, a data dictionary, and shopping cart functionality.
Denodo Data Virtualization - IT Days in Luxembourg with OktopusDenodo
1) Denodo provides a data virtualization platform that connects disparate data sources and allows users to access and analyze enterprise data without moving or replicating it.
2) Customers like Bank of the West, Intel, and Asurion saw improvements like faster time to market, increased agility, and cost savings by using Denodo to replace ETL processes and create a single access layer for all their data.
3) Denodo's platform provides capabilities for data abstraction, zero replication, performance optimization, data governance, and deployment in multiple locations.
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...Cambridge Semantics
Thomas Cook, director of sales, Cambridge Semantics, offers a primer on graph database technology and the rapid growth of knowledge graphs at Data Summit 2020 in his presentation titled "AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Connected World".
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Data Con LA
The document discusses how an Enterprise Data Lake (EDL) provides a more effective solution for enterprise BI and analytics compared to traditional enterprise data warehouses (EDW). It argues that EDL allows enterprises to retain all datasets, service ad-hoc requests with no latency or development time, and offer a low-cost, low-maintenance solution that supports direct analytics and reporting on data stored in its native format. The document promotes EDL as a mainstream solution that should be part of every mid-sized and large enterprise's standard IT stack.
Building enterprise advance analytics platformHaoran Du
Raymond Fu gave a presentation on building an enterprise analytics platform at the SoCal Data Science Conference. He has over 16 years of experience in big data, business intelligence, and enterprise architecture. He discussed how big data disrupts traditional architecture and requires new skills. Advanced analytics involves creating predictive models through machine learning to enable strategic and operational decisions. An enterprise analytics strategy involves data management, modernizing data platforms, and operationalizing advanced analytics models. Fu outlined the key capabilities needed for data management, analytics creation, and analytics operationalization. He provided examples of reference architectures and services that can be used to build an enterprise analytics platform.
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.
The enormous legacy of EDW experience and best practices can be adapted to the unique capabilities of the Hadoop environment. In this webinar, in a point-counterpoint format, Dr. Kimball will describe standard data warehouse best practices including the identification of dimensions and facts, managing primary keys, and handling slowly changing dimensions (SCDs) and conformed dimensions. Eli Collins, Chief Technologist at Cloudera, will describe how each of these practices actually can be implemented in Hadoop.
Data Warehouse Design and Best PracticesIvo Andreev
A data warehouse is a database designed for query and analysis rather than for transaction processing. An appropriate design leads to scalable, balanced and flexible architecture that is capable to meet both present and long-term future needs. This session covers a comparison of the main data warehouse architectures together with best practices for the logical and physical design that support staging, load and querying.
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...NICSA
With the proliferation of Big Data-oriented technology and its accompanying applications of advanced statistical techniques, asset managers are enabling their sales and marketing teams with more insight into the preferences and proclivities of their clients, both advisors and investors. This webinar will give attendees a general understanding of Big Data’s technologies and techniques especially as they pertain to using predictive analytics for more effective and targeted marketing and distribution.
Desired Outcomes:
Understanding Big Data and how it is enabling adopters to use data more effectively than in the past
Familiarity with some of the technological and analytical approaches Big Data enables
Understanding of attribution models for measuring advisor and investor responsiveness
Knowledge of how to prioritize campaigns and contacts by combining measures of valuation and responsiveness
Grasp of some of the more effective way to adopt predictive analysis for sales and marketing
Understanding basics of recommender systems and how next best action is determined
Big data architectures and the data lakeJames Serra
The document provides an overview of big data architectures and the data lake concept. It discusses why organizations are adopting data lakes to handle increasing data volumes and varieties. The key aspects covered include:
- Defining top-down and bottom-up approaches to data management
- Explaining what a data lake is and how Hadoop can function as the data lake
- Describing how a modern data warehouse combines features of a traditional data warehouse and data lake
- Discussing how federated querying allows data to be accessed across multiple sources
- Highlighting benefits of implementing big data solutions in the cloud
- Comparing shared-nothing, massively parallel processing (MPP) architectures to symmetric multi-processing (
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...Avinash Ramineni
Enterprises have been rapidly adopting data lakes as a complement or replacement of data warehouses. Many of the Data lake implementations are ignoring the inherent drawbacks and limitations of Data Lakes and ending up as data swamps with little or no benefit to the businesses. In this session we will go through some of challenges and the key aspects that need to be considered for successful Data lake implementations.
This document discusses managing storage across public and private resources. It covers the evolution of on-site storage management, storage options in the public cloud, and challenges of managing hybrid cloud storage. Key topics include the transition from siloed storage to software-defined storage, various cloud storage services like object storage and block storage, challenges of public cloud limitations, and solutions for connecting on-site and cloud storage like gateways, file systems, and caching appliances.
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Cloudian
This document discusses implementing Hadoop and Elastic MapReduce on Cloudian's scale-out object storage platform. It describes Cloudian's hybrid cloud storage capabilities and how their approach reduces costs and provides faster analytics by analyzing log and event data directly on their storage platform without needing to transform the data for HDFS. Key benefits highlighted include no redundant storage, scaling analytics with storage capacity by adding nodes, and taking advantage of multi-core CPUs for MapReduce tasks.
Move your on prem data to a lake in a Lake in CloudCAMMS
With the boom in data; the volume and its complexity, the trend is to move data to the cloud. Where and How do we do this? Azure gives you the answer. In this session, I will give you an introduction to Azure Data Lake and Azure Data Factory, and why they are good for the type of problem we are talking about. You will learn how large datasets can be stored on the cloud, and how you could transport your data to this store. The session will briefly cover Azure Data Lake as the modern warehouse for data on the cloud,
SQL, NoSQL, Distributed SQL: Choose your DataStore carefullyMd Kamaruzzaman
In modern Software Development and Software Architecture, selecting the right DataStore is one of the most challenging and important task. In this presentation, I have summarized the major DataStores and the decision criteria to select the right DataStore according to the use case.
This document discusses architecting a data lake. It begins by introducing the speaker and topic. It then defines a data lake as a repository that stores enterprise data in its raw format including structured, semi-structured, and unstructured data. The document outlines some key aspects to consider when architecting a data lake such as design, security, data movement, processing, and discovery. It provides an example design and discusses solutions from vendors like AWS, Azure, and GCP. Finally, it includes an example implementation using Azure services for an IoT project that predicts parts failures in trucks.
Apache Geode Meetup, Cork, Ireland at CITApache Geode
This document provides an introduction to Apache Geode (incubating), including:
- A brief history of Geode and why it was developed
- An overview of key Geode concepts such as regions, caching, and functions
- Examples of interesting large-scale use cases from companies like Indian Railways
- A demonstration of using Geode with Apache Spark and Spring XD for a stock prediction application
- Information on how to get involved with the Geode open source project community
This document discusses storage requirements for running Spark workloads on Kubernetes. It recommends using a distributed file system like HDFS or DBFS for distributed storage and emptyDir or NFS for local temp scratch space. Logs can be stored in emptyDir or pushed to object storage. Features that would improve Spark on Kubernetes include image volumes, flexible PV to PVC mappings, encrypted volumes, and clean deletion for compliance. The document provides an overview of Spark, Kubernetes benefits, and typical Spark deployments.
The document discusses optimizing Drupal performance by measuring performance metrics, implementing caching techniques and modules, optimizing database and application code, and configuring web and application servers. It provides an overview of Sergata and their focus on innovation and startups, and recommends analyzing performance bottlenecks and leveraging caching, CDNs, and server configuration to improve performance.
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
In a world of serverless computing users tend to be frugal when it comes to expenditure on compute, storage and other resources. Paying for the same when they aren’t in use becomes a significant factor. Offering Spark as service on cloud presents very unique challenges. Running Spark on Kubernetes presents a lot of challenges especially around storage and persistence. Spark workloads have very unique requirements of Storage for intermediate data, long time persistence, Share file system and requirements become very tight when it same need to be offered as a service for enterprise to mange GDPR and other compliance like ISO 27001 and HIPAA certifications.
This talk covers challenges involved in providing Serverless Spark Clusters share the specific issues one can encounter when running large Kubernetes clusters in production especially covering the scenarios related to persistence.
This talk will help people using Kubernetes or docker runtime in production and help them understand various storage options available and which is more suitable for running Spark workloads on Kubernetes and what more can be done
This document summarizes Oracle's strategy and product offerings. Oracle's strategy is to provide products that are complete, open, integrated and best-in-class. It highlights several of Oracle's key products, including the Oracle Database, Oracle Fusion Middleware, Oracle Exadata, Oracle Exalogic, Oracle Applications and Oracle server and storage systems. It notes that Oracle's products hold leading positions in their categories and are optimized to work together for better performance and lower costs.
Teradata Loom is a software that helps users realize the full potential of their Hadoop data lakes. It provides data cataloging, profiling, and lineage tracking to help users find, understand, and prepare their data. Loom's active scanning capabilities automatically discover and profile new data. Its interactive Weaver tool allows self-service data wrangling. Loom is integrated with Hadoop and simplifies data lake management to increase analyst productivity.
Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. It provides the freedom to query data at scale using either serverless or dedicated options. Azure HDInsight allows the use of open source frameworks like Hadoop, Spark, Hive, and Kafka for processing large volumes of data. Azure Databricks offers environments for SQL, data science/engineering, and machine learning. The Azure IoT Hub enables scalable IoT solutions by allowing bidirectional communication between IoT applications and connected devices.
Full 360 is a cloud consulting firm that provides big data, API/UX, and cloud operations services. They helped a customer migrate their data from Netezza to Redshift, building a structured data lake and optimizing queries for equivalent or better performance. Lessons from the project included data standardization, tuning techniques like encoding and sort keys, and creating reusable ingestion processes. The migration reduced license costs and improved operational flexibility.
Clouds are made of on-demand, scalable computing resources that are accessed as a service via the internet. There are different cloud deployment models (public, private, hybrid) and service models (IaaS, PaaS, SaaS). Infrastructure as a service (IaaS) clouds provide fundamental computing resources like storage, networking and virtual machines, while platform as a service (PaaS) clouds provide additional services like databases, messaging queues and development tools. Choosing between IaaS and PaaS involves considering factors like lock-in to the cloud vendor, control over the infrastructure, and application requirements.
Apache Geode is an open source in-memory data grid that provides data distribution, replication and high availability. It can be used for caching, messaging and interactive queries. The presentation discusses Geode concepts like cache, region and member. It provides examples of how large companies use Geode for applications requiring real-time response, high concurrency and global data visibility. Geode's performance comes from minimizing data copying and contention through flexible consistency and partitioning. The project is now hosted by Apache and the community is encouraged to get involved through mailing lists, code contributions and example applications.
Cloud computing provides on-demand access to computing resources like storage, networking, and servers that can be rapidly provisioned without long wait times. There are public clouds run by third parties and private clouds within a company's own data center. Public clouds offer elastic resources without large upfront costs but less control, while private clouds offer more control within existing infrastructure limitations. Major cloud providers like Amazon Web Services offer infrastructure as a service (IaaS) like computing and storage, and platform as a service (PaaS) that automates services like databases.
Accelerating Business Intelligence Solutions with Microsoft Azure passJason Strate
Business Intelligence (BI) solutions need to move at the speed of business. Unfortunately, roadblocks related to availability of resources and deployment often present an issue. What if you could accelerate the deployment of an entire BI infrastructure to just a couple hours and start loading data into it by the end of the day. In this session, we'll demonstrate how to leverage Microsoft tools and the Azure cloud environment to build out a BI solution and begin providing analytics to your team with tools such as Power BI. By end of the session, you'll gain an understanding of the capabilities of Azure and how you can start building an end to end BI proof-of-concept today.
Data Science Day New York: Data Science: A Personal HistoryCloudera, Inc.
Understand the path Jeff Hammerbacher from Facebook and building scalable systems on Hadoop to Co-founding Cloudera and building an organization that provides the leading Hadoop platform.
Similar to Building A Self Service Analytics Platform on Hadoop (20)
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
The Microsoft 365 Migration Tutorial For Beginner.pptxoperationspcvita
This presentation will help you understand the power of Microsoft 365. However, we have mentioned every productivity app included in Office 365. Additionally, we have suggested the migration situation related to Office 365 and how we can help you.
You can also read: https://www.systoolsgroup.com/updates/office-365-tenant-to-tenant-migration-step-by-step-complete-guide/
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
4. 4Page
Quick Poll
• Big Data Deployments in Prod
• Hadoop Distributions
• People use Ecosystems rather than tools
• Architecture was implemented on Cloudera
• Cloud Experience – AWS ?
5. 5Page
Challenges
• Data in Silos
• Acquires Perspectives as data is moved
• Data availability delays
• Legacy Systems handling the Volume , Veracity and Velocity
• Extracting data from legacy systems
• Lack of Self-Service Capabilities
• Knowledge becomes tribal – instead of institutional
• Security / Compliance Requirements
6. 6Page
Data Lake Attributes
• Data Democratization
• Data Discovery
• Data Lineage
• Self-Service capabilities
• Metadata Management
8. 8Page
Self-Service at all Levels
Ingest Organize Enrich Analyze Dashboards
AnalyzeIngest Organize Enrich Insights
9. 9Page
Key Design Tenets
• Separation of Compute and Storage
• Independently scale compute and storage
• Data Democratization and Governance
• Bring your own Compute (BYOC)
• HA / DR
• Open Source Stack
10. 1
0
Page
Separation of Compute and Storage
• Scale storage and compute independently
• Shifts bottleneck from Disk IO to Network
• Centralized Data Storage
• Data Democratization
• No data duplication
• Easier Hardware upgrade paths
• Flexible Architecture
• DR Simplified
11. 1
1
Page
BYOC (Bring Your Own Cluster)
• Each department/application can bring its own Hadoop cluster
• Eliminates the need for very large clusters
• Easier to administer and maintain
• Reduces multi-tenancy issues
• Clusters can be upgraded independently
• Enables usage based cost model
Centralized / Common S3 Storage
Marketing
Cluster
Centralized
Storage
Personalization
Cluster
Main
Cluster
13. 1
3
Page
Architecture – Data Ingestion Layer
• DB Ingestor
• Stream Ingestor
• Kafka and Spark Streaming
• File Ingestor
• FTP / SFTP / Logs
• Ingestion using Service API
14. 1
4
Page
Architecture – Data Processing Layer
• Storage layer carved into logical buckets
• Landing, Raw, Derived and Delivery
• Schema stored with data (no guesswork)
• Platform Jobs
• Converting text to Parquet
• Saving streaming data Parquet
• Derivatives
• Compaction
• Standardization
15. 1
5
Page
Architecture – Data Delivery Layer
• Data Delivery
• SQL - Spark Thrift Server / Impala
• Tableau, SQL IDE, Applications
• Self Service
• Derivatives
• Represented Via SQL on Delivery Layer
• Stored in Derived Storage Layer
• Metadata driven
• Derived Layer Generators
• Long running Spark Job
• Derivative Refresh
17. 1
7
Page
Key Takeaways - Spark Thrift Server
• Spark Thrift Server Support
• Performance Tuning
• Concurrency
• partition strategy
• Cache Tables
• Compression Codec for Parquet
• Snappy vs gzip
18. 1
8
Page
Key Takeaways - Security
• Secure by Design, Secure by Default
• Access to Data on S3
• IAM Roles
• Sentry
• Support for Spark
• Kerberos
• Spark Thrift Server
• Navigator
• Support for Spark
19. 1
9
Page
Key Takeaways - General
• Rapidly Changing Technology
• Feature addition
• Documentation
• Bugs
• Jar hell
• Small files
• Performance Issues
• Compaction
20. 2
0
Page
Key Takeaways - General
• Partition Strategy
• Parquet Files
• Balancing parallelism and throughput
• Table Partitions
• Cluster sizing, optimization and tuning
• Integrating with Corporate infrastructure
• Deployment practices
• Monitoring and Alerting
• Information Security Policies