HiFX designed and implemented a unified data analytics platform called Vision Lens for Malayala Manorama to generate meaningful insights from large amounts of data across their multiple digital properties. The solution involved building a data lake, data pipeline, processing framework, and dashboards to provide real-time and historical analytics. This helped Manorama improve user experiences, drive smarter marketing, and make better business decisions.
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...Databricks
"The modernization of the tobacco industry is resulting in a shift towards a more data-driven approach to trade, operations and the consumer. The need to scale while maintaining margins is paramount, and today’s consumer requires more personalized engagement and value at every interaction to drive sales and revenue.
At Altria, we’re at the forefront of this evolution, leveraging hundreds of terabytes of big data (such as point-of-sale, clickstream, mobile data, and more) and machine learning to improve our ability to make smarter decisions and outpace the competition. This talk recaps our big data journey from a legacy data infrastructure (Teradata), isolated data systems, and the lack of resources which prevented our ability to move quickly and scale, to our current state where we’ve successfully implemented, architected and on-boarded tools and processes in stages of data acquisition, store, prepare, and business intelligence with Azure Data Lake, Azure Databricks, Azure Data factory, APIs Managements, Streaming and Hosting technologies and provided Data Analytics platform.
We’ll discuss the roadblocks we came across, how we overcame them, and how we employed a unified approach to big data and analytics through the fully managed Azure Databricks platform and the Azure suite of tools which allowed us to streamline workflows, improve operational performance, and ultimately introduce new customer experiences that drive engagement and revenue."
Data Lakes: 8 Enterprise Data Management RequirementsSnapLogic
2016 is the year of the data lake. As you consider adopting an enterprise data lake strategy to manage more dynamic, poly-structured data, your data integration strategy must also evolve to handle new requirements. Thinking you can simply hire more developers to write code or rely on your legacy rows-and-columns centric tools is a recipe to sink in a data swamp instead of swimming in a data lake.
In this presentation, you'll learn about eight enterprise data management requirements that must be addressed in order to get maximum value from your big data technology investments.
To learn more, visit: https://www.snaplogic.com/big-data
Presented by Jack Norris, SVP Data & Applications at Gartner Symposium 2016.
Jack presents how companies from TransUnion to Uber use event-driven processing to transform their business with agility, scale, robustness, and efficiency advantages.
More info: https://www.mapr.com/company/press-releases/mapr-present-gartner-symposiumitxpo-and-other-notable-industry-conferences
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsInformatica
This presentation is geared toward enterprise architects and senior IT leaders looking to drive more value from their data by learning about cloud data lake management.
As businesses focus on leveraging big data to drive digital transformation, technology leaders are struggling to keep pace with the high volume of data coming in at high speed and rapidly evolving technologies. What's needed is an approach that helps you turn petabytes into profit.
Cloud data lakes and cloud data warehouses have emerged as a popular architectural pattern to support next-generation analytics. Informatica's comprehensive AI-driven cloud data lake management solution natively ingests, streams, integrates, cleanses, governs, protects and processes big data workloads in multi-cloud environments.
Please leave any questions or comments below.
Traditional data storage and analytic tools no longer provide the agility and flexibility required to deliver relevant business insights. That’s why organizations are shifting to a data lake architecture. This approach allows you to store massive amounts of data in a central location so it's readily available to be categorized, processed, analyzed, and consumed by diverse organizational groups. In this session, we’ll assemble a data lake using services such as Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue.
Big Data Management: What's New, What's Different, and What You Need To KnowSnapLogic
This presentation is from a recorded webinar with 451 Research analyst and thought leader Matt Aslett for a discussion about the growing importance of the right data management best practices and techniques for delivering on the promise of big data in the enterprise. Matt reviews the big data landscape, how the data lake complements and competes with the data warehouse, and key takeaways as you move from big data test and development environments to production. You can watch the webinar here: http://bit.ly/25ShiQu
Democratizing data science Using spark, hive and druidDataWorks Summit
MZ is re-inventing how the entire world experiences data via our mobile games division MZ Games Studios, our digital marketing division Cognant, and our live data platform division Satori.
Growing need of data science capabilities across the organization requires an architecture that can democratize building these applications and disseminating insight from the outcome of data science applications to the wider organization.
Attend this session to learn about how we built a platform for data science using spark, hive, and druid specifically for our performance marketing division cognant.This platform powers several data science application like fraud detection and bid optimization at large scale.
We will be sharing lessons learned over past 3 years in building this platform by also walking through some of the actual data science applications built on top of this platform.
Attendees from ML engineering and data science background can gain deep insight from our experience of building this platform.
Speakers
Pushkar Priyadarshi, Director of Engineer, Michaine Zone Inc
Igor Yurinok, Staff Software Engineer, MZ
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...Databricks
"The modernization of the tobacco industry is resulting in a shift towards a more data-driven approach to trade, operations and the consumer. The need to scale while maintaining margins is paramount, and today’s consumer requires more personalized engagement and value at every interaction to drive sales and revenue.
At Altria, we’re at the forefront of this evolution, leveraging hundreds of terabytes of big data (such as point-of-sale, clickstream, mobile data, and more) and machine learning to improve our ability to make smarter decisions and outpace the competition. This talk recaps our big data journey from a legacy data infrastructure (Teradata), isolated data systems, and the lack of resources which prevented our ability to move quickly and scale, to our current state where we’ve successfully implemented, architected and on-boarded tools and processes in stages of data acquisition, store, prepare, and business intelligence with Azure Data Lake, Azure Databricks, Azure Data factory, APIs Managements, Streaming and Hosting technologies and provided Data Analytics platform.
We’ll discuss the roadblocks we came across, how we overcame them, and how we employed a unified approach to big data and analytics through the fully managed Azure Databricks platform and the Azure suite of tools which allowed us to streamline workflows, improve operational performance, and ultimately introduce new customer experiences that drive engagement and revenue."
Data Lakes: 8 Enterprise Data Management RequirementsSnapLogic
2016 is the year of the data lake. As you consider adopting an enterprise data lake strategy to manage more dynamic, poly-structured data, your data integration strategy must also evolve to handle new requirements. Thinking you can simply hire more developers to write code or rely on your legacy rows-and-columns centric tools is a recipe to sink in a data swamp instead of swimming in a data lake.
In this presentation, you'll learn about eight enterprise data management requirements that must be addressed in order to get maximum value from your big data technology investments.
To learn more, visit: https://www.snaplogic.com/big-data
Presented by Jack Norris, SVP Data & Applications at Gartner Symposium 2016.
Jack presents how companies from TransUnion to Uber use event-driven processing to transform their business with agility, scale, robustness, and efficiency advantages.
More info: https://www.mapr.com/company/press-releases/mapr-present-gartner-symposiumitxpo-and-other-notable-industry-conferences
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsInformatica
This presentation is geared toward enterprise architects and senior IT leaders looking to drive more value from their data by learning about cloud data lake management.
As businesses focus on leveraging big data to drive digital transformation, technology leaders are struggling to keep pace with the high volume of data coming in at high speed and rapidly evolving technologies. What's needed is an approach that helps you turn petabytes into profit.
Cloud data lakes and cloud data warehouses have emerged as a popular architectural pattern to support next-generation analytics. Informatica's comprehensive AI-driven cloud data lake management solution natively ingests, streams, integrates, cleanses, governs, protects and processes big data workloads in multi-cloud environments.
Please leave any questions or comments below.
Traditional data storage and analytic tools no longer provide the agility and flexibility required to deliver relevant business insights. That’s why organizations are shifting to a data lake architecture. This approach allows you to store massive amounts of data in a central location so it's readily available to be categorized, processed, analyzed, and consumed by diverse organizational groups. In this session, we’ll assemble a data lake using services such as Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue.
Big Data Management: What's New, What's Different, and What You Need To KnowSnapLogic
This presentation is from a recorded webinar with 451 Research analyst and thought leader Matt Aslett for a discussion about the growing importance of the right data management best practices and techniques for delivering on the promise of big data in the enterprise. Matt reviews the big data landscape, how the data lake complements and competes with the data warehouse, and key takeaways as you move from big data test and development environments to production. You can watch the webinar here: http://bit.ly/25ShiQu
Democratizing data science Using spark, hive and druidDataWorks Summit
MZ is re-inventing how the entire world experiences data via our mobile games division MZ Games Studios, our digital marketing division Cognant, and our live data platform division Satori.
Growing need of data science capabilities across the organization requires an architecture that can democratize building these applications and disseminating insight from the outcome of data science applications to the wider organization.
Attend this session to learn about how we built a platform for data science using spark, hive, and druid specifically for our performance marketing division cognant.This platform powers several data science application like fraud detection and bid optimization at large scale.
We will be sharing lessons learned over past 3 years in building this platform by also walking through some of the actual data science applications built on top of this platform.
Attendees from ML engineering and data science background can gain deep insight from our experience of building this platform.
Speakers
Pushkar Priyadarshi, Director of Engineer, Michaine Zone Inc
Igor Yurinok, Staff Software Engineer, MZ
My presentation slides from Hadoop Summit, San Jose, June 28, 2016. See live video at http://www.makedatauseful.com/vid-solving-performance-problems-hadoop/ and follow along for context.
Moving analytic workloads into production - specific technical challenges and best practices for engineering SQL in Hadoop solutions. Highlighting the next generation engineering approaches to the secret sauce we have implemented in the Actian VectorH database.
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016StampedeCon
This session will detail best practices for architecting, building, operating and managing an Analytics Data Lake platform. Key topics will include:
1) Defining next-generation Data Lake architectures. The defacto standard has been commodity DAS servers with HDFS, but there are now multiple solutions aimed at separating compute and storage, virtualizing or containerizing Hadoop applications, and utilizing Hadoop compatible or embedded HDFS filesystems. This portion will explore the options available, and the pros and cons of each.
2) Data Ingest. There are many ways to load data into a Data Lake, including standardized Apache tools (Sqoop, Flume, Kafka, Storm, Spark, NiFi), standard file and object protocols (SFTP, NFS, Rest, WebHDFS), and proprietary tools (eg, Zaloni Bedrock, DataTorrent). This section will explore these options in the context of best fit to workflows; it will also look at key gaps and challenges, particularly in the areas of data formats and integration with metadata/cataloging tools.
3) Metadata & Cataloguing. One of the biggest inhibitors of successful Data Lake deployments is Data Governance, particularly in the areas of indexing, cataloguing and metadata management. It is nearly impossible to run analytics on top of a Data Lake and get meaningful & timely results without solving these problems. This portion will explore both emerging open standards (Apache Atlas, HCatalog) and proprietary tools (Cloudera Navigator, Zaloni Bedrock/Mica, Informatica Metadata Manager), and balance the pros, cons and gaps of each.
4) Security & Access Controls. Solving these challenges are key for adoption in regulatory driven industries like Healthcare & Financial Services. There are multiple Apache projects and proprietary tools to address this, but the challenge is making security and access controls consistent across the entire application and infrastructure stack, and over the data lifecycle, and being able to audit this in the face of legal challenges. This portion will explore available options and best practices.
5) Provisioning & Workflow Management. The real promise of the Data Lake is integrating Analytics workflows and tools on converged infrastructure-with shared data-and build “As A Service” oriented architectures that are oriented towards self-service data exploration and Analytics for end users. This is an emerging and immature area, but this session will explore some potential concepts, tools and options to achieve this.
This will be a moderately technical session, with the above topics being illustrated by real world examples. Attendees should have basic familiarity with Hadoop and the associated Apache projects.
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
In essence, a data lake is commodity distributed file system that acts as a repository to hold raw data file extracts of all the enterprise source systems, so that it can serve the data management and analytics needs of the business. A data lake system provides means to ingest data, perform scalable big data processing, and serve information, in addition to manage, monitor and secure the it environment. In these slide, we discuss building data lakes using Azure Data Factory and Data Lake Analytics. We delve into the architecture if the data lake and explore its various components. We also describe the various data ingestion scenarios and considerations. We introduce the Azure Data Lake Store, then we discuss how to build Azure Data Factory pipeline to ingest the data lake. After that, we move into big data processing using Data Lake Analytics, and we delve into U-SQL.
Optimizing industrial operations using the big data ecosystemDataWorks Summit
GE Digital is undertaking a journey to optimize the reliability, availability, and efficiency of assets in the industrial sector and converge IT and OT. To do so, GE Digital is building cloud-based products that enable customers to analyze the asset data, detect anomalies, and provide recommendations for operating plants efficiently while increasing productivity. In a energy sector such as oil and gas, power, or renewables, a single plant comprises multiple complex assets, such as steam turbines, gas turbines, and compressors, to generate power. Each system contains various sensors to detect the operating conditions of the assets, generating large volumes of variety of data. A highly scalable distributed environment is required to analyze such a large volume of data and provide operating insights in near real time.
In this session I will share the challenges encountered when analyzing the large volumes of data, in-stream data analysis and how we standardized the industrial data based on data frames, and performance tuning.
In this session we will take a look at Azure Data Lake from an administrator's perspective.
Do you know who has what access where? How much data is in your data lake? What about the accesses to the data lake, is everything running normally?
In this session we will show you what possibilities the portal offers you to keep an eye on the Azure Data Lake. In addition, we will show you further scripts and tools to perform the corresponding tasks.
Dive with us into the depths of your Data Lake.
This is a brief technology introduction to Oracle Stream Analytics, and how to use the platform to develop streaming data pipelines that support a wide variety of industry use cases
In general, data can be broken into two categories – data in motion vs data at rest. Learn the difference between these two types of data and the best infrastructure options to get optimal performance.
Presentation to discuss major shift in enterprise data management. Describes movement away from older hub and spoke data architecture and towards newer, more modern Kappa data architecture
A series of tweets I posted about my 11hr struggle to make a cup of tea with my WiFi kettle ended-up going viral, got picked-up by the national and then international press, and led to thousands of retweets, comments and references in the media. In this session we’ll take the data I recorded on this Twitter activity over the period and use Oracle Big Data Graph and Spatial to understand what caused the breakout and the tweet going viral, who were the key influencers and connectors, and how the tweet spread over time and over geography from my original series of posts in Hove, England.
The AWS Big Data services are inherently built to run at @scale. In this session, you will learn how to develop an enterprise scale big data application using AWS services such as Amazon EMR, Amazon Redshift & Redshift Spectrum, Amazon Athena, Amazon Elasticsearch Service, Amazon Kinesis, Amazon QuickSight and AWS Glue. This session will also cover different architectural patterns and customer use cases.
In this session we take an in-depth look into the Apache Atlas open metadata and governance function.
Open metadata and governance is a moon-shot type of project to create a set of open APIs, types, and interchange protocols to allow all metadata repositories to share and exchange metadata. From this common base, it adds governance, discovery, and access frameworks to automate the collection, management, and use of metadata across an enterprise. The result is an enterprise catalog of data resources that are transparently assessed, governed, and used in order to deliver maximum value to the enterprise.
Apache Atlas is the reference implementation of the Open Metadata and Governance standards and framework (https://cwiki.apache.org/confluence/display/ATLAS/Open+Metadata+and+Governance). This function will enable an Apache Atlas server to synchronize and query metadata from any open metadata-compliant metadata repository.
In this session we will cover how Open Metadata and Governance works. This includes: (1) the key components in Atlas, (2) the different integration patterns and APIs that vendors can use to integrate their technology into the open metadata ecosystem, and (3) how common metadata use cases such as searching for data sets, managing security (through Atlas/Ranger integration), and automated metadata discovery work in the active ecosystem.
Speaker
Mandy Chessell, Distinguished Engineer, IBM
My presentation slides from Hadoop Summit, San Jose, June 28, 2016. See live video at http://www.makedatauseful.com/vid-solving-performance-problems-hadoop/ and follow along for context.
Moving analytic workloads into production - specific technical challenges and best practices for engineering SQL in Hadoop solutions. Highlighting the next generation engineering approaches to the secret sauce we have implemented in the Actian VectorH database.
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016StampedeCon
This session will detail best practices for architecting, building, operating and managing an Analytics Data Lake platform. Key topics will include:
1) Defining next-generation Data Lake architectures. The defacto standard has been commodity DAS servers with HDFS, but there are now multiple solutions aimed at separating compute and storage, virtualizing or containerizing Hadoop applications, and utilizing Hadoop compatible or embedded HDFS filesystems. This portion will explore the options available, and the pros and cons of each.
2) Data Ingest. There are many ways to load data into a Data Lake, including standardized Apache tools (Sqoop, Flume, Kafka, Storm, Spark, NiFi), standard file and object protocols (SFTP, NFS, Rest, WebHDFS), and proprietary tools (eg, Zaloni Bedrock, DataTorrent). This section will explore these options in the context of best fit to workflows; it will also look at key gaps and challenges, particularly in the areas of data formats and integration with metadata/cataloging tools.
3) Metadata & Cataloguing. One of the biggest inhibitors of successful Data Lake deployments is Data Governance, particularly in the areas of indexing, cataloguing and metadata management. It is nearly impossible to run analytics on top of a Data Lake and get meaningful & timely results without solving these problems. This portion will explore both emerging open standards (Apache Atlas, HCatalog) and proprietary tools (Cloudera Navigator, Zaloni Bedrock/Mica, Informatica Metadata Manager), and balance the pros, cons and gaps of each.
4) Security & Access Controls. Solving these challenges are key for adoption in regulatory driven industries like Healthcare & Financial Services. There are multiple Apache projects and proprietary tools to address this, but the challenge is making security and access controls consistent across the entire application and infrastructure stack, and over the data lifecycle, and being able to audit this in the face of legal challenges. This portion will explore available options and best practices.
5) Provisioning & Workflow Management. The real promise of the Data Lake is integrating Analytics workflows and tools on converged infrastructure-with shared data-and build “As A Service” oriented architectures that are oriented towards self-service data exploration and Analytics for end users. This is an emerging and immature area, but this session will explore some potential concepts, tools and options to achieve this.
This will be a moderately technical session, with the above topics being illustrated by real world examples. Attendees should have basic familiarity with Hadoop and the associated Apache projects.
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
In essence, a data lake is commodity distributed file system that acts as a repository to hold raw data file extracts of all the enterprise source systems, so that it can serve the data management and analytics needs of the business. A data lake system provides means to ingest data, perform scalable big data processing, and serve information, in addition to manage, monitor and secure the it environment. In these slide, we discuss building data lakes using Azure Data Factory and Data Lake Analytics. We delve into the architecture if the data lake and explore its various components. We also describe the various data ingestion scenarios and considerations. We introduce the Azure Data Lake Store, then we discuss how to build Azure Data Factory pipeline to ingest the data lake. After that, we move into big data processing using Data Lake Analytics, and we delve into U-SQL.
Optimizing industrial operations using the big data ecosystemDataWorks Summit
GE Digital is undertaking a journey to optimize the reliability, availability, and efficiency of assets in the industrial sector and converge IT and OT. To do so, GE Digital is building cloud-based products that enable customers to analyze the asset data, detect anomalies, and provide recommendations for operating plants efficiently while increasing productivity. In a energy sector such as oil and gas, power, or renewables, a single plant comprises multiple complex assets, such as steam turbines, gas turbines, and compressors, to generate power. Each system contains various sensors to detect the operating conditions of the assets, generating large volumes of variety of data. A highly scalable distributed environment is required to analyze such a large volume of data and provide operating insights in near real time.
In this session I will share the challenges encountered when analyzing the large volumes of data, in-stream data analysis and how we standardized the industrial data based on data frames, and performance tuning.
In this session we will take a look at Azure Data Lake from an administrator's perspective.
Do you know who has what access where? How much data is in your data lake? What about the accesses to the data lake, is everything running normally?
In this session we will show you what possibilities the portal offers you to keep an eye on the Azure Data Lake. In addition, we will show you further scripts and tools to perform the corresponding tasks.
Dive with us into the depths of your Data Lake.
This is a brief technology introduction to Oracle Stream Analytics, and how to use the platform to develop streaming data pipelines that support a wide variety of industry use cases
In general, data can be broken into two categories – data in motion vs data at rest. Learn the difference between these two types of data and the best infrastructure options to get optimal performance.
Presentation to discuss major shift in enterprise data management. Describes movement away from older hub and spoke data architecture and towards newer, more modern Kappa data architecture
A series of tweets I posted about my 11hr struggle to make a cup of tea with my WiFi kettle ended-up going viral, got picked-up by the national and then international press, and led to thousands of retweets, comments and references in the media. In this session we’ll take the data I recorded on this Twitter activity over the period and use Oracle Big Data Graph and Spatial to understand what caused the breakout and the tweet going viral, who were the key influencers and connectors, and how the tweet spread over time and over geography from my original series of posts in Hove, England.
The AWS Big Data services are inherently built to run at @scale. In this session, you will learn how to develop an enterprise scale big data application using AWS services such as Amazon EMR, Amazon Redshift & Redshift Spectrum, Amazon Athena, Amazon Elasticsearch Service, Amazon Kinesis, Amazon QuickSight and AWS Glue. This session will also cover different architectural patterns and customer use cases.
In this session we take an in-depth look into the Apache Atlas open metadata and governance function.
Open metadata and governance is a moon-shot type of project to create a set of open APIs, types, and interchange protocols to allow all metadata repositories to share and exchange metadata. From this common base, it adds governance, discovery, and access frameworks to automate the collection, management, and use of metadata across an enterprise. The result is an enterprise catalog of data resources that are transparently assessed, governed, and used in order to deliver maximum value to the enterprise.
Apache Atlas is the reference implementation of the Open Metadata and Governance standards and framework (https://cwiki.apache.org/confluence/display/ATLAS/Open+Metadata+and+Governance). This function will enable an Apache Atlas server to synchronize and query metadata from any open metadata-compliant metadata repository.
In this session we will cover how Open Metadata and Governance works. This includes: (1) the key components in Atlas, (2) the different integration patterns and APIs that vendors can use to integrate their technology into the open metadata ecosystem, and (3) how common metadata use cases such as searching for data sets, managing security (through Atlas/Ranger integration), and automated metadata discovery work in the active ecosystem.
Speaker
Mandy Chessell, Distinguished Engineer, IBM
Using AWS to design and build your data architecture has never been easier to gain insights and uncover new opportunities to scale and grow your business. Join this workshop to learn how you can gain insights at scale with the right big data applications.
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Amazon Web Services
Join us for this general session where AWS big data experts present an in-depth look at the current state of big data. Learn about the latest big data trends and industry use cases. Hear how other organizations are using the AWS big data platform to innovate and remain competitive. Take a look at some of the most recent AWS big data developments. Learn More: https://aws.amazon.com/government-education/
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...Amazon Web Services
Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. In this session, we first present an end-to-end streaming data solution using Amazon Kinesis Data Streams for data ingestion, Amazon Kinesis Data Analytics for real-time processing, and Amazon Kinesis Data Firehose for persistence. We review in detail how to write SQL queries for operational monitoring using Kinesis Data Analytics.
Learn how PNNL is building their ingestion flow into their Serverless Data Lake leveraging the Kinesis Platform. At times migrating existing NiFi Processes where applicable to various parts of the Kinesis Platform, replacing complex flows on Nifi to bundle and compress the data with Kinesis Firehose, leveraging Kinesis Streams for their enrichment and transformation pipelines, and using Kinesis Analytics to Filter, Aggregate, and detect anomalies.
this is part 3 of the series on Data Mesh ... looking at the intersection of microservices architecture concepts, data integration / replication technologies and log-based stream integration techniques. This webinar was mostly a demonstration, but several slides used to setup the demo are included here as a PDF for viewers.
Using AWS to design and build your data architecture has never been easier to gain insights and uncover new opportunities to scale and grow your business. Join this workshop to learn how you can gain insights at scale with the right big data applications.
This overview presentation discusses big data challenges and provides an overview of the AWS Big Data Platform by covering:
- How AWS customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs.
- Reference architectures for popular use cases, including, connected devices (IoT), log streaming, real-time intelligence, and analytics.
- The AWS big data portfolio of services, including, Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR), and Redshift.
- The latest relational database engine, Amazon Aurora— a MySQL-compatible, highly-available relational database engine, which provides up to five times better performance than MySQL at one-tenth the cost of a commercial database.
Created by: Rahul Pathak,
Sr. Manager of Software Development
The challenge of computing big data for evolving digital business processes demands variety of computation techniques and engines (SQL, OLAP, time-series, graph, document store), but working in unified framework. A simple architecture of data transformations while ensuring the security, governance, and operational administration are the necessary critical components for enterprise production environments supporting day-to-day business processes. In this session, you will learn about best practices & critical components to ensure business value from latest production deployments. Hear how existing customers are using SAP Vora and the value they have achieved so far with this in-memory engine for distributed data processing. The session provides you with a clear understanding how SAP Vora and open source components like Apache Hadoop and Apache Spark offer an architecture that supports a wide variety of use cases and industries. You will also receive very useful insight where to find development resources, test drive demos, and general documentation.
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
Using real time big data analytics for competitive advantageAmazon Web Services
Many organisations find it challenging to successfully perform real-time data analytics using their own on premise IT infrastructure. Building a system that can adapt and scale rapidly to handle dramatic increases in transaction loads can potentially be quite a costly and time consuming exercise.
Most of the time, infrastructure is under-utilised and it’s near impossible for organisations to forecast the amount of computing power they will need in the future to serve their customers and suppliers.
To overcome these challenges, organisations can instead utilise the cloud to support their real-time data analytics activities. Scalable, agile and secure, cloud-based infrastructure enables organisations to quickly spin up infrastructure to support their data analytics projects exactly when it is needed. Importantly, they can ‘switch off’ infrastructure when it is not.
BluePi Consulting and Amazon Web Services (AWS) are giving you the opportunity to discover how organisations are using real time data analytics to gain new insights from their information to improve the customer experience and drive competitive advantage.
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
Data lakes are providing immense value to organizations embracing data science.
In this webinar, William will discuss the value of having broad, detailed, and seemingly obscure data available in cloud storage for purposes of expanding Data Science in the organization.
ACDKOCHI19 - Medlife's journey on AWS from ZERO Orders to 6 digits markAWS User Group Kochi
AWS Community Day Kochi 2019 - Technical Session
Medlife's journey on AWS from ZERO Orders to 6 digits mark by Pranesh Vittal , Associate Director - Database & DevOps at Medlife.com
ACDKOCHI19 - Become Thanos of the Lambda Land: Wield all the Infinity StonesAWS User Group Kochi
AWS Community Day Kochi 2019 - Technical Session
Become Thanos of the Lambda Land: Wield all the Infinity Stones by Srushith R , Head of Engineering - KonfHub
AWS Community Day Kochi 2019 - Technical Session
Rapid development, CI/CD for Chatbots on AWS by Muthukumar Oman, , Senior Architect - AWS Cloud & Big Data Solutions - Agilisium
ACDKOCHI19 - Complete Media Content Management System and Website on ServerlessAWS User Group Kochi
AWS Community Day Kochi 2019 - Technical Session
Complete Media Content Management System and Website on Serverless by Anoop Mohan, Associate Director Of Technology at Asianet
ACDKOCHI19 - A minimalistic guide to keeping things simple and straightforwar...AWS User Group Kochi
AWS Community Day Kochi 2019 - Technical Session
A minimalistic guide to keeping things simple and straightforward on AWS by Jeevan Dongre , AWS Community Hero, Lead: AWS UG BLR
ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...AWS User Group Kochi
AWS Community Day Kochi 2019 - Sponsor Talks
Journey from a traditional on-prem Datacenter to AWS: Challenges and Opportunities By Thomas Brennekke , Founder & President, Network Redux
ACDKOCHI19 - Enterprise grade security for web and mobile applications on AWSAWS User Group Kochi
AWS Community Day Kochi 2019 - Technical Session
Enterprise grade security for web and mobile applications on AWS by Robin Varghese , Chief Architect - TCS
ACDKOCHI19 - Turbocharge Developer productivity with platform build on K8S an...AWS User Group Kochi
AWS Community Day Kochi 2019 - Technical Session
Turbocharge Developer productivity with platform build on K8S and AWS services by - Laks , Principal Engineer - Intuit
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Key Trends Shaping the Future of Infrastructure.pdf
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
1. About HiFX
Established in the year 2001, HiFX is an Amazon Web Services
Consulting Partner.
We have been designing and migrating workloads in AWS cloud
since 2010 and helping organizations to become truly data driven
by building big data solutions since 2015
2. Case Study with Malayala Manorama
Malayala Manorama is one of the largest media conglomerates in India. They
run manoramaonline.com, the largest news portal for Malayalees, around the
world and several digital media properties
In 2016, Manorama embarked on a project to develop an in-house
analytics pipeline that could unify enormous amounts of raw data from
multiple web domains and convert it into meaningful insights. The
company currently has 10 domains such as its matrimonial and real
estate sites, with plans to further expand its digital footprint.
HiFX, has been Malayala Manorama’s technology partner for more than
18 years and was approached to design this new data analytics pipeline.
3. Manorama Online
Manorama News
The Week
Vanitha
Watchtime India
E-paper/E-magazine
Chuttuvattom
OnManorama
M4Marry
HelloAddress
QuickeralaQkdoc
Entedeal
Manorama Horizon
Android
iOS
Manorama MAX
4. 2
The Challenges
Lack of agility and accessibility for data analysis which would aid the product team to
make smart business decisions and improve strategies
Increasing volume and velocity of data. With new digital properties getting added, there
was a need to design the collection and storage layers that would scale well
Dozens of independently managed collections of data, leading to data silos. Having no
single source of truth was leading to difficulties in identifying what type of data is
available, getting access to it and integration.
Poorly recorded data. Often, the meaning and granularity of the data was getting lost in
processing
Dozens of independently managed collections of data, leading to data silos. Having no
single source of truth was leading to difficulties in identifying what type of data is
available, granting access and integration
04
03
02
01
6. Vision Lens is a unified data platform with a
consolidated solution stack to
generate meaningful real time
insights and drive revenue
“
“
Better product
decisions based
on behavioral
insights
Add value
to our
businesses
€
Increase CLV
Deeply
understand every
user's journey
Immediate actions,
smart targeting and
marketing
automation
Positively
impact KPIs
7. Components
02A well governed data lake
architected to store raw and
enriched data thereby
eliminating storage silos
WELL GOVERNED DATA LAKE
01 Connecting dozens of data
streams and repositories to a
unified data pipeline enabling
near real-time access to any data
source
UNIFIED DATA PIPELINE
03
Data processing framework to
support streams and batches
workloads to aid analytics and
machine learning along with
smart workflow management
DATA PROCESSING
FRAMEWORK
05
Recommendations and
personalization engine powered
by machine learning
RECOMMENDATIONS ENGINE
04Well designed big data stores
for reporting and exploratory
analysis
BIG DATA STORES FOR OLAP
06
Dynamic dashboards and smart
visualizations that makes data
tell stories and drives insights.
SMART DASHBOARDS
8. Solution Stack
04
Track Key metrics : visits,
plays,dropouts and minutes
watched
VIDEO ANALYTICS
Watch Attention shift in
real time
Updates every few
seconds to quickly
capitalize attention to
every post, campaign and
sections
STREAMING ANALYTICS
01
02Historical View of unique
attention metrics to understand
what happened in the past and
use it to plan for the future
BATCH ANALYTICS
03
Integrations with Google
Accelerated Pages and
Facebook Instant articles
FB IA AND GOOGLE AMP
INTEGRATIONS
05
Recommendations and
personalization engine
powered by machine learning
CONTENT
PERSONALIZATION
06
Dynamic dashboards and smart
visualizations that makes data
tell stories and drives insights.
ADVANCED REPORTING
Clean structured data that
team can analyze directly
RAW DATA ACCESS07
11. Trackers
Android SDK IOS SDK JS SDK PHP SDK Java SDK
Data / Event Trackers
Trackers allow us to collect data from any
type of digital application, service or
device. All trackers adhere to the LENS
Tracker Protocol.
12. Collectors-Scribe
Data Collectors
04
03
Written in
Go/Java
02
Designed for
Low LatencyEngineered for
High
Concurrency
Horizontally
Scalable
01
Scribe collects data from the trackers
and writes them to the Kinesis data
firehose.
This allows near-real time processing of
data as well as storage in the data lake
for further batch analysis.
Use ECS Fargate for the
containerization.
Scribe API endpoints
• Event tracker
• Pixel tracker
• Click tracker
• AMP tracker
13. Accumulo /Data Lake
A
ACCUMULO
The data consumer component
responsible for -
Reading data from the event
firehose ( Kinesis Streams )
Performing rudimentary data
quality checks
Converting data to Avro
Format with Snappy
Compression
Loading them to the Data
Lake
DATA LAKE
Data Lake supports the following
capabilities
Capture and store raw data securely
at scale at a low cost
Store many types of data in the same
repository
Define the structure of the data at
the time it is used
It is designed to
Retain all data
Support all data types
Adapt Easily to changes
14. Prism - Processing Engine
Using Apache Spark as our processing Engine.
It’s written in Scala.
It can run on EMR-5.27 and as a Databricks job running
on AWS spot/on-demand instances
Unified Processing Engine
Prism
Analytics Engine
15. Prism - Processing Engine
Data Cleanser
Performs data cleansing
including:
• Normalization
• De-duplication
• Bot-exclusion
• Fixes for client clock issues.
Data Enricher
Performs enrichment activities
including:
• User Agent Parsing to
understand OS / Platform
• Referrer Parsing to understand
channels
• IP to location transformation
• Lat+Long to location
transformation
• Widen event data with user
profile information
Data Quality Checks
Performs the data quality checks
needed to detect, report and omit
instrumentation errors
Data Reconciler
Reconciles data that is
sacrosanct like transactions
from the feeds generated by
the master db
Sessionization/User Merging
Sessionize and merge the users
based on domain/anonymous id
15
Prism
Analytics Engine
Data Refresher
Loads the data to respective tables
in the data warehouse and other
reporting data stores
16. Prism - Real-time Analytics
• Use structured streaming to stream live events
into Elastic Search.
• Stack can be run on both EMR and Databricks
• Run in 50 -4.x large instances, which is scaled
to 100 instances during the election time.
• Configurations:-
spark.executor.cores=4
spark.executor.memory=25g
spark.executor.instances=50
Spark Streaming
Spark Streaming
17. Prism - Batch Analytics
Spark on EMR/Databricks
Spark• Scheduled Job which kick off every
day to process all the events for a
day and write the cleansed
raw/aggregated data to the redshift
(primary data store).
• It also writes the data to Parquet
Format to run presto/Databricks
delta lake on the top if needed.
• Runs in 20 – r4.2xlarge instances
• Configurations:-
spark.executor.cores=3
spark.executor.memory=20g
spark.executor.instances=39
18. Data Stores
DATA WAREHOUSE
AMAZON REDSHIFT
Primary Data Store
• Supports batch workloads.
• Supports up to 50
concurrent queries
• Cache layer pgpool deployed
• WLM and concurrency
scaling enabled
• Elastic Resize
• Redshift spectrum to query
archived data in S3
01
REALTIME REPORTING STORE
Elasticsearch
Content Analytics Real Time
Dashboard.
• Fluidic Dashboard with
granular filters
• Data Visualization using
Kibana
02
RECOMMENDATION RESULTS
DYNAMODB
Features like,
Horizontally Scalability, low
operational overhead and
predictable performance
make Dynamodb a good
choice for storing
recommendation results
03
19. Orchestration
Used to programmatically author, schedule
and monitor workflows.
Workflow Management
Rich UI that makes it easy to visualize
pipelines running in production, monitor
progress, and troubleshoot issues when
needed.
Rich UI
Apache Airflow
20. Data Retention Strategy
Find a balance between what’s optimal for your clients’ business needs vs. operational cost effectiveness
Ensure the data retention policies align with the regulatory restrictions(GDPR)
Define proper life cycle policies at different stages
S3-IA/Glacier lifecycle policy defined for the data at rest in Data lake and a scheduled purging policy defined
for the primary data store(redshift)
We keep a quarter worth of data in the primary data store(redshift) and older data is archived to S3.
Redshift Spectrum is used for detailed analysis of older data.
For YOY, QOQ comparison we pre-calculate it as part of the quarterly process and store the aggregated results
into the data store.
21. Page Views
Dashboard - KPIs/Different Angles
Domain Specific KPIs
Key Metrics in the Content
Dashboard.
Different Angles
New and returning Visitors
Explore the Content Data from these
Angles
Engaged Time
Social Shares and
Referrals
Bounce Rate
Video Play Rate
Titles
Authors
Sections
Tags
Referrers
Campaigns
Google AMP Facebook IA
22. Scalability /Performance
Collect, Storage and Process layers designed to Autoscale.
Batch analytics takes an average of 30-40 mins to process and refresh data for the entire day
across all reporting dashboards
Turnaround latency numbers at the data collector: 75 percentile - 27ms and 95 percentile - 156
ms
Currently handles about 150 GB of data per day with an average of 300 million events processed
per day
Horizontally Scalable Data Collectors, Data Consumers, Data Processors and Data Reporting
Stores
04
03
02
01
The real time streaming stack currently processes 500K events in less than 10 seconds.
05
06
23. Best Practices in Spark
Use Dataset, DataFrames, Spark SQL instead of RDD to get the benefits of catalyst optimizer
Choose the best data format and compression.
Apache Parque gives the fastest read performance with the spark with its vectorized Parquet reader. Run
presto/Databricks delta lake on the top if needed.
Avro offers rich schema support and more efficient writes than Parquet.
Choose either Snappy or LZO compression as they have balance in terms of split-ability and block compression.
Use the Spark Web UI to explore your task jobs, storage, and SQL query plan to optimize your spark execution
Look at the spark event timeline to see the amount of time for each stage/tasks
Check the shuffles between stages and the amount of data shuffled(Use the spark.sql.shuffle.partitions option
if needed)
Check the join algorithms being used.
Broadcast join should be used when one table is small.
Sort-merge join should be used for large tables. You can use bucketing to pre-sort and group tables; this will
avoid shuffling in the sort merge
Enable Dynamic Partition Pruning/ flattenScalarSubqueriesWithAggregates/ Bloom Filter Join/ Optimized Join
Reorder
Use s3 instead of s3a/s3n protocol to refer the data so that it goes through the optimized path
Use EMRFS consistency only if its required
Find an optimal configurations on number of executors, memory setting for each executors and the no of cores for
the spark job.
24. Outcomes
Ability to run targeted mobile push and email campaigns
Consistent KPI measurement. The client has a consistent framework across properties to
measure KPIs
Dozens of independently managed collections of data, leading to data silos. Having no single
source of truth was leading to difficulties in identifying what type of data is available, getting
access to it and integration.
Better user experience. Recommendations running off the data in the Data Lake add value to the
digital properties we manage
Better business agility and product decisions based on behavioural insights. The journey from
data to decisions is made swifter
04
03
02
01