The document provides an overview of data governance on AWS. It discusses key aspects of data governance including availability, usability, consistency, integrity and security. It covers considerations for data access, regulations, and architecture implications. Specific AWS services for data storage, processing and security are outlined, such as S3, VPC, IAM, encryption options. The document emphasizes account segregation, identity-based access policies, and designing networks and data flows with big data applications in mind.
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
What we're about
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips?
Some of our online materials:
Website:
https://big-data-demystified.ninja/
Youtube channels:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
Meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
https://www.meetup.com/Big-Data-Demystified
Facebook Group :
https://www.facebook.com/groups/amazon.aws.big.data.demystified/
Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/)
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty
amazon aws big data demystified meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
Introduction to streaming and messaging flume kafka sqs kinesis
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS & GCP and Data Center infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORC,AVRO which technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL? GCS? Big Query? Data flow? Data Lab? tensor flow? how to handle streaming? how to manage costs? Performance tips? Security tip? Cloud best practices tips?
In this meetup we shall present lecturers working on several cloud vendors, various big data platforms such hadoop, Data warehourses , startups working on big data products. basically - if it is related to big data - this is THE meetup.
Some of our online materials (mixed content from several cloud vendor):
Website:
https://big-data-demystified.ninja (under construction)
Meetups:
https://www.meetup.com/Big-Data-Demystified
https://www.meetup.com/AWS-Big-Data-Demystified/
You tube channels:
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
a comprehensive good introduction to the the Big data world in AWS cloud, hadoop, Streaming, batch, Kinesis, DynamoDB, Hbase, EMR, Athena, Hive, Spark, Piq, Impala, Oozie, Data pipeline, Security , Cost, Best practices
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company
The video: https://youtu.be/l5KmaZNQxaU
dont forget to subcribe to the youtube channel
The website: https://amazon-aws-big-data-demystified.ninja/
The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleScyllaDB
Zeotap’s Connect product addresses the challenges of identity resolution and linking for AdTech and MarTech. Zeotap manages roughly 20 billion ID and growing. In their presentation, Zeotap engineers will delve into data access patterns, processing and storage requirements to make a case for a graph-based store. They will share the results of PoCs made on technologies such as D-graph, OrientDB, Aeropike and Scylla, present the reasoning for selecting JanusGraph backed by Scylla, and take a deep dive into their data model architecture from the point of ingestion. Learn what is required for the production setup, configuration and performance tuning to manage data at this scale.
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
What we're about
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips?
Some of our online materials:
Website:
https://big-data-demystified.ninja/
Youtube channels:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
Meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
https://www.meetup.com/Big-Data-Demystified
Facebook Group :
https://www.facebook.com/groups/amazon.aws.big.data.demystified/
Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/)
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty
amazon aws big data demystified meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
Introduction to streaming and messaging flume kafka sqs kinesis
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS & GCP and Data Center infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORC,AVRO which technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL? GCS? Big Query? Data flow? Data Lab? tensor flow? how to handle streaming? how to manage costs? Performance tips? Security tip? Cloud best practices tips?
In this meetup we shall present lecturers working on several cloud vendors, various big data platforms such hadoop, Data warehourses , startups working on big data products. basically - if it is related to big data - this is THE meetup.
Some of our online materials (mixed content from several cloud vendor):
Website:
https://big-data-demystified.ninja (under construction)
Meetups:
https://www.meetup.com/Big-Data-Demystified
https://www.meetup.com/AWS-Big-Data-Demystified/
You tube channels:
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
a comprehensive good introduction to the the Big data world in AWS cloud, hadoop, Streaming, batch, Kinesis, DynamoDB, Hbase, EMR, Athena, Hive, Spark, Piq, Impala, Oozie, Data pipeline, Security , Cost, Best practices
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company
The video: https://youtu.be/l5KmaZNQxaU
dont forget to subcribe to the youtube channel
The website: https://amazon-aws-big-data-demystified.ninja/
The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleScyllaDB
Zeotap’s Connect product addresses the challenges of identity resolution and linking for AdTech and MarTech. Zeotap manages roughly 20 billion ID and growing. In their presentation, Zeotap engineers will delve into data access patterns, processing and storage requirements to make a case for a graph-based store. They will share the results of PoCs made on technologies such as D-graph, OrientDB, Aeropike and Scylla, present the reasoning for selecting JanusGraph backed by Scylla, and take a deep dive into their data model architecture from the point of ingestion. Learn what is required for the production setup, configuration and performance tuning to manage data at this scale.
At Signal we've been running Apache Cassandra in production since late 2011. We use a multi-region Cassandra deployment to make our data available globally to our customers. While Cassandra does much of the heavy lifting for us, we've run into interesting challenges during periods of rapid growth. In this presentation we'll focus on one of those scenarios, including our before and after data model, methodology and tools we used to recover and lessons learned along the way.
Shift: Real World Migration from MongoDB to CassandraDataStax
Presentation on SHIFT's migration from MongoDB to Cassandra. Topics will include reasons behind choosing to move to Cassandra, zero downtime migration strategy, data modeling patterns, and the benefits of using CQL3.
Cassandra vs. ScyllaDB: Evolutionary DifferencesScyllaDB
Apache Cassandra and ScyllaDB are distributed databases capable of processing massive globally-distributed workloads. Both use the same CQL data query language. In this webinar you will learn:
- How are they architecturally similar and how are they different?
- What's the difference between them in performance and features?
- How do their software lifecycles and release cadences contrast?
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax Academy
Internet of Things (IoT) data frequently has a location and time component. Getting value out of this "geotemporal" data can be tricky. We'll explore when and how to leverage Cassandra, DSE Search and DSE Analytics to surface meaningful information from your geotemporal data.
Cassandra as event sourced journal for big data analyticsAnirvan Chakraborty
Avoiding destructive updates and keeping history of data using event sourcing approaches has large advantages for data analytics. This talk describes how Cassandra can be used as event journal as part of CQRS/Lambda Architecture using event sourcing and further used for data mining and machine learning purposes in a big data pipeline.
All the principles are demonstrated on an application called Muvr that we built. It uses data from wearable devices such as accelerometer in a watch or heartbeat monitor to classify user's exercises in near real time. It uses mobile devices and clustered Akka actor framework to distribute computation and then stores events as immutable facts in journal backed by Cassandra. The data are then read by Apache Spark and used for more expensive analytics and machine learning tasks such as suggests improvements to user's exercise routine or improves machine learning models for better real time exercise classification that can be used immediately. The talk mentions some of the internals of Spark when working with Cassandra and focuses on its machine learning capabilities enabled by Cassandra. A lot of the analytics are done for each user individually so the whole pipeline must handle potentially large amount of concurrent users and a lot of raw data so we need to ensure attributes such as responsiveness, elasticity and resilience.
How to Build a Scylla Database Cluster that Fits Your NeedsScyllaDB
Sizing a database cluster makes or breaks your application. Too small and you could sustain spikes in usage and recover from a node loss or an operational slowdown. Too big and your cluster will cost more and waste valuable human resources.
Since different workloads have different requirements, successful sizing of your application should be optimized for both throughput and latency performance. However, in many cases, the requirements for each contradicts each other.
In this webinar, we explain how to remediate the contradicting forces and build a sustainable cluster to meet both performance and resiliency requirements.
Cloud Spanner is the first and only relational database service that is both strongly consistent and horizontally scalable. With Cloud Spanner you enjoy all the traditional benefits of a relational database: ACID transactions, relational schemas (and schema changes without downtime), SQL queries, high performance, and high availability. But unlike any other relational database service, Cloud Spanner scales horizontally, to hundreds or thousands of servers, so it can handle the highest of transactional workloads.
Improving Organizational Knowledge with Natural Language Processing Enriched ...DataWorks Summit
The information age has allowed everyone to tap into the exponential production of data. Unfortunately, much actionable insight is the result of unexpected or anomalous behavior that can only be recognized through experience. A collection of NLP microservices was crafted to complement an organization’s existing technology infrastructure in order to translate and bring additional meaning to an organization’s already existing and real time collection of unstructured text.
In this session, and in collaboration with Partners & Co., a Chicago-based real estate firm, we will demonstrate how we can leverage an organization’s collective knowledge and turn unstructured text that is generated from across various communication mediums into real time actionable insight. We will demonstrate how we can use a combination of open source tools such as Apache NiFi, Kafka, OpenNLP, and Superset to build a full streaming NLP pipeline to consume unstructured text, detect the language and sentences within the text, deconstruct the grammatical makeup, and derive meaning of the entities identified within the text.
CyberAgent is a leading Internet company in Japan focused on smartphone social communities and a game platform known as Ameba, which has 40M users. In this presentation, we will introduce how we use HBase for storing social graph data and as a basis for ad systems, user monitoring, log analysis, and recommendation systems.
Disney+ Hotstar: Scaling NoSQL for Millions of Video On-Demand UsersScyllaDB
Disney+ Hotstar is the fastest growing branch of Disney+. Join Disney+ Hotstar Architect Vamsi Subhash and senior data engineer Balakrishnan Kaliyamoorthy to learn…
How Disney+ Hotstar architected their systems to handle massive data loads
Why they chose to replace both Redis and Elasticsearch
Their requirements for massively scalable data infrastructure and evolving data models
How they migrated their data to Scylla Cloud, ScyllaDB’s fully managed NoSQL database-as-a-service, without suffering downtime
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsDataStax
We'll be covering some aspects of our architecture, highlighting differences between MongoDB and Cassandra. We'll go in depth to explain why Cassandra is a better choice for our general purpose Application Platform (SHIFT) as well as our Media Buying Analytics tool (the SHIFT Media Manager). We'll be going over common design patterns people might be familiar with coming from a background with MongoDB and highlight how Cassandra would be used as a better alternative. We'll also touch more on cqlengine which is nearing feature completeness as the Cassandra object mapper for Python.
Dyn delivers exceptional Internet Performance. Enabling high quality services requires data centers around the globe. In order to manage services, customers need timely insight collected from all over the world. Dyn uses DataStax Enterprise (DSE) to deploy complex clusters across multiple datacenters to enable sub 50 ms query responses for hundreds of billions of data points. From granular DNS traffic data, to aggregated counts for a variety of report dimensions, DSE at Dyn has been up since 2013 and has shined through upgrades, data center migrations, DDoS attacks and hardware failures. In this webinar, Principal Engineers Tim Chadwick and Rick Bross cover the requirements which led them to choose DSE as their go-to Big Data solution, the path which led to SPARK, and the lessons that we’ve learned in the process.
What Kiwi.com Has Learned Running ScyllaDB and GoScyllaDB
Kiwi.com, a global travel booking site, uses Scylla as its search engine storage backend. Since last Scylla Summit, Kiwi.com has migrated from Cassandra to Scylla. Find out how our distributed database topology influences the development of all our applications. Also learn how we rewrote our core services, originally written in Python, as Go, and how we obtained performance improvements with the gocql driver.
How vulnerable are your systems after the first line of defense? Do attackers get a stronger foothold after each compromise? How valuable is the data your systems can leak?
“Death Star” security describes a system that relies entirely on an outermost security layer and fails catastrophically when breached. As services multiply, they shouldn’t all run in a single, trusted virtual private cloud. Sharing secrets doesn’t scale either, as systems multiply and partners integrate with your product and users.
David Strauss explores security methods strong enough to cross the public Internet, flexible enough to allow new services without altering existing systems, and robust enough to avoid single points of failure. David covers the basics of public key infrastructure (PKI), explaining how PKI uniquely supports security and high availability, and demonstrates how to deploy mutual authentication and encryption across a heterogeneous infrastructure, use capability-based security, and use federated identity to provide a uniform frontend experience while still avoiding monolithic backends. David also explores JSON Web Tokens as a solution to session woes, distributing user data and trust without sharing backend persistence.
A good written summary of the key talking points: https://www.infoq.com/news/2016/04/oreilysacon-day-one
At Signal we've been running Apache Cassandra in production since late 2011. We use a multi-region Cassandra deployment to make our data available globally to our customers. While Cassandra does much of the heavy lifting for us, we've run into interesting challenges during periods of rapid growth. In this presentation we'll focus on one of those scenarios, including our before and after data model, methodology and tools we used to recover and lessons learned along the way.
Shift: Real World Migration from MongoDB to CassandraDataStax
Presentation on SHIFT's migration from MongoDB to Cassandra. Topics will include reasons behind choosing to move to Cassandra, zero downtime migration strategy, data modeling patterns, and the benefits of using CQL3.
Cassandra vs. ScyllaDB: Evolutionary DifferencesScyllaDB
Apache Cassandra and ScyllaDB are distributed databases capable of processing massive globally-distributed workloads. Both use the same CQL data query language. In this webinar you will learn:
- How are they architecturally similar and how are they different?
- What's the difference between them in performance and features?
- How do their software lifecycles and release cadences contrast?
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax Academy
Internet of Things (IoT) data frequently has a location and time component. Getting value out of this "geotemporal" data can be tricky. We'll explore when and how to leverage Cassandra, DSE Search and DSE Analytics to surface meaningful information from your geotemporal data.
Cassandra as event sourced journal for big data analyticsAnirvan Chakraborty
Avoiding destructive updates and keeping history of data using event sourcing approaches has large advantages for data analytics. This talk describes how Cassandra can be used as event journal as part of CQRS/Lambda Architecture using event sourcing and further used for data mining and machine learning purposes in a big data pipeline.
All the principles are demonstrated on an application called Muvr that we built. It uses data from wearable devices such as accelerometer in a watch or heartbeat monitor to classify user's exercises in near real time. It uses mobile devices and clustered Akka actor framework to distribute computation and then stores events as immutable facts in journal backed by Cassandra. The data are then read by Apache Spark and used for more expensive analytics and machine learning tasks such as suggests improvements to user's exercise routine or improves machine learning models for better real time exercise classification that can be used immediately. The talk mentions some of the internals of Spark when working with Cassandra and focuses on its machine learning capabilities enabled by Cassandra. A lot of the analytics are done for each user individually so the whole pipeline must handle potentially large amount of concurrent users and a lot of raw data so we need to ensure attributes such as responsiveness, elasticity and resilience.
How to Build a Scylla Database Cluster that Fits Your NeedsScyllaDB
Sizing a database cluster makes or breaks your application. Too small and you could sustain spikes in usage and recover from a node loss or an operational slowdown. Too big and your cluster will cost more and waste valuable human resources.
Since different workloads have different requirements, successful sizing of your application should be optimized for both throughput and latency performance. However, in many cases, the requirements for each contradicts each other.
In this webinar, we explain how to remediate the contradicting forces and build a sustainable cluster to meet both performance and resiliency requirements.
Cloud Spanner is the first and only relational database service that is both strongly consistent and horizontally scalable. With Cloud Spanner you enjoy all the traditional benefits of a relational database: ACID transactions, relational schemas (and schema changes without downtime), SQL queries, high performance, and high availability. But unlike any other relational database service, Cloud Spanner scales horizontally, to hundreds or thousands of servers, so it can handle the highest of transactional workloads.
Improving Organizational Knowledge with Natural Language Processing Enriched ...DataWorks Summit
The information age has allowed everyone to tap into the exponential production of data. Unfortunately, much actionable insight is the result of unexpected or anomalous behavior that can only be recognized through experience. A collection of NLP microservices was crafted to complement an organization’s existing technology infrastructure in order to translate and bring additional meaning to an organization’s already existing and real time collection of unstructured text.
In this session, and in collaboration with Partners & Co., a Chicago-based real estate firm, we will demonstrate how we can leverage an organization’s collective knowledge and turn unstructured text that is generated from across various communication mediums into real time actionable insight. We will demonstrate how we can use a combination of open source tools such as Apache NiFi, Kafka, OpenNLP, and Superset to build a full streaming NLP pipeline to consume unstructured text, detect the language and sentences within the text, deconstruct the grammatical makeup, and derive meaning of the entities identified within the text.
CyberAgent is a leading Internet company in Japan focused on smartphone social communities and a game platform known as Ameba, which has 40M users. In this presentation, we will introduce how we use HBase for storing social graph data and as a basis for ad systems, user monitoring, log analysis, and recommendation systems.
Disney+ Hotstar: Scaling NoSQL for Millions of Video On-Demand UsersScyllaDB
Disney+ Hotstar is the fastest growing branch of Disney+. Join Disney+ Hotstar Architect Vamsi Subhash and senior data engineer Balakrishnan Kaliyamoorthy to learn…
How Disney+ Hotstar architected their systems to handle massive data loads
Why they chose to replace both Redis and Elasticsearch
Their requirements for massively scalable data infrastructure and evolving data models
How they migrated their data to Scylla Cloud, ScyllaDB’s fully managed NoSQL database-as-a-service, without suffering downtime
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsDataStax
We'll be covering some aspects of our architecture, highlighting differences between MongoDB and Cassandra. We'll go in depth to explain why Cassandra is a better choice for our general purpose Application Platform (SHIFT) as well as our Media Buying Analytics tool (the SHIFT Media Manager). We'll be going over common design patterns people might be familiar with coming from a background with MongoDB and highlight how Cassandra would be used as a better alternative. We'll also touch more on cqlengine which is nearing feature completeness as the Cassandra object mapper for Python.
Dyn delivers exceptional Internet Performance. Enabling high quality services requires data centers around the globe. In order to manage services, customers need timely insight collected from all over the world. Dyn uses DataStax Enterprise (DSE) to deploy complex clusters across multiple datacenters to enable sub 50 ms query responses for hundreds of billions of data points. From granular DNS traffic data, to aggregated counts for a variety of report dimensions, DSE at Dyn has been up since 2013 and has shined through upgrades, data center migrations, DDoS attacks and hardware failures. In this webinar, Principal Engineers Tim Chadwick and Rick Bross cover the requirements which led them to choose DSE as their go-to Big Data solution, the path which led to SPARK, and the lessons that we’ve learned in the process.
What Kiwi.com Has Learned Running ScyllaDB and GoScyllaDB
Kiwi.com, a global travel booking site, uses Scylla as its search engine storage backend. Since last Scylla Summit, Kiwi.com has migrated from Cassandra to Scylla. Find out how our distributed database topology influences the development of all our applications. Also learn how we rewrote our core services, originally written in Python, as Go, and how we obtained performance improvements with the gocql driver.
How vulnerable are your systems after the first line of defense? Do attackers get a stronger foothold after each compromise? How valuable is the data your systems can leak?
“Death Star” security describes a system that relies entirely on an outermost security layer and fails catastrophically when breached. As services multiply, they shouldn’t all run in a single, trusted virtual private cloud. Sharing secrets doesn’t scale either, as systems multiply and partners integrate with your product and users.
David Strauss explores security methods strong enough to cross the public Internet, flexible enough to allow new services without altering existing systems, and robust enough to avoid single points of failure. David covers the basics of public key infrastructure (PKI), explaining how PKI uniquely supports security and high availability, and demonstrates how to deploy mutual authentication and encryption across a heterogeneous infrastructure, use capability-based security, and use federated identity to provide a uniform frontend experience while still avoiding monolithic backends. David also explores JSON Web Tokens as a solution to session woes, distributing user data and trust without sharing backend persistence.
A good written summary of the key talking points: https://www.infoq.com/news/2016/04/oreilysacon-day-one
by Roy Feintuch, Dome9
Join us for four days of security and compliance sessions and hands-on labs led by our AWS security pros during AWS Security Week at the San Francisco Loft. Join us for all four days, or pick just the days that are most relevant to you. We'll open on Monday with Security 101 day, followed by sessions Tuesday on Identity and Access Management, our popular Threat Detection and Remediation day Wednesday will feature an updated GuardDuty lab, and we'll end Thursday with Incident Response sessions, labs, and a talk by Netflix on their new open source IR tool. This week will also feature Dome9 as a sponsor, and you can hear them speak and present a hands-on workshop Monday during Security 101 day.
Security and privacy of cloud data: what you need to know (Interop)Druva
Is your company thinking about the cloud or already in there but learning fast about the many challenges of the security and privacy of cloud data?
Learn more about the landscape of data in the cloud, and the obstacles that every company should consider when it comes to protecting their down.
Next-Generation Security Operations with AWS | AWS Public Sector Summit 2016Amazon Web Services
With the upswell of cloud adoption, many traditional infrastructure paradigms are shifting. Security is no different. The cloud service provider industry is discovering new ways to tackle security, including automation, bottomless logging, scalable analysis clusters, and pluggable security tools. This session presents a case study in extending a traditional infrastructure operation into AWS. We provide a practical look into the technical challenges and benefits of operating in this new paradigm, explore incident response automation (Alexa integration), and provide various examples of shifting an on-premises security operation to a scalable, hybrid model. Through lessons learned and analysis, we show why your data is safer in the cloud than in that rack you can touch in your data
Core strategies to develop defense in depth in AWSShane Peden
Information security guidance and strategies for securing cloud infrastructure in Amazon Web Services, presented by risk3sixty LLC and Afonza. Atlanta based cyber risk management.
Securing your database servers from external attacksAlkin Tezuysal
A critical piece of your infrastructure is the database tier, yet people don't pay enough attention to it judging by how many are bitten via poorly chosen defaults, or just a lack understanding of running a secure database tier. In this talk, I'll focus on MySQL/MariaDB, PostgreSQL and MongoDB, and cover external authentication, auditing, encryption, SSL, firewalls, replication, and more gems from over a decade of consulting in this space from Percona's 4,000+ customers.
O mundo da nuvem trouxe vários facilitadores. Porém, com ele, também vieram novos desafios e superfícies de ataque que precisam ser estudadas, entendidas e monitoradas.
Atualmente temos por volta de 300 serviços na AWS que nos trazem milhares de ações que podem ser utilizadas para o bem ou mal.
Foi sobre esses desafios que Rodrigo Montoro, nosso Head of Threat and Detection Research, palestrou em seu painel "Construindo um plano de resposta a incidentes para ambientes AWS", na 8ª edição do Mind the Sec.
Cloud computing transforms the way we can store, process and share our data. New applications and workloads are growing rapidly, which brings every day more sensitive data into the conversation about risk and what constitutes natural targets for bad actors. This presentation reflects on current best practices to address the most significant security concerns for sensitive data in the cloud, and offers participants a list of steps to achieve enterprise-grade safety with MongoDB deployments among the expanding service provider options.
In today's rapidly evolving tech landscape, data privacy is one of the most critical issues that businesses face. I'd like to share with you my insights on the principles and best practices for ensuring the resilience and security of your workload.
Drawing on a real-life project from the HR industry, I will demonstrate how we tackled the challenges of data protection, self-healing, business continuity, security, and transparency of data processing. Through our solutions, we were able to create a secure AWS cloud infrastructure that not only met strict compliance rules but also exceeded our client's expectations.
Designing for Privacy in Amazon Web ServicesKrzysztofKkol1
Data privacy is one of the most critical issues that businesses face. This presentation shares insights on the principles and best practices for ensuring the resilience and security of your workload.
Drawing on a real-life project from the HR industry, the various challenges will be demonstrated: data protection, self-healing, business continuity, security, and transparency of data processing. This systematized approach allowed to create a secure AWS cloud infrastructure that not only met strict compliance rules but also exceeded the client's expectations.
How to design the architecture and processes for the application which needs to process protected and personal data? This presentation is based on a real-life project, implemented in Xebia. Presented on AWS Community Day NL in Utrecht, NL. 20.09.2023.
Cloud Security and some preferred practicesMichael Pearce
Cloud Security and some preferred practices. Security isn't easy, but here is why it matters, the difference between security and compliance and what we can do to implement it and mitigate some of the risks.
Michael Pearce, DevOps Engineer @ Peak AI.
With Enterprise data growing rapidly year over year, traditional analytics approaches have proven to be expensive and unyielding. The result is that a growing proportion of our data is unused “dark data”. How can we create the basis for a data driven organization? Enter the "perfect storm" of cloud data analytics tools and approaches.
Cloud Security for Regulated Firms - Securing my cloud and proving itHentsū
As a regulated cloud user, security and compliance are two of your primary concerns, a workshop on how to keep secure and demonstrate your compliance to key stakeholders.
Specifically, what can be done to secure cloud resources and show compliance for auditors, investors, DDQs, SSAE16, covering:
- Strategies for securing data in transit and at rest
- Federating with your internal directory for role based access to your cloud
- Capturing and processing audit logs for security event notifications
- Fun with Infrastructure as Code – detecting and reverting misconfigurations and manual changes
Similar to AWS Big Data Demystified #4 data governance demystified [security, network and data access management] (20)
Couchbase Data Platform | Big Data DemystifiedOmid Vahdaty
Couchbase is a popular open source NoSQL platform used by giants like Apple, LinkedIn, Walmart, Visa and many others and runs on-premise or in a public/hybrid/multi cloud.
Couchbase has a sub-millisecond K/V cache integrated with a document based DB, a unique and many more services and features.
In this session we will talk about the unique architecture of Couchbase, its unique N1QL language - a SQL-Like language that is ANSI compliant, the services and features Couchbase offers and demonstrate some of them live.
We will also discuss what makes Couchbase different than other popular NoSQL platforms like MongoDB, Cassandra, Redis, DynamoDB etc.
At the end we will talk about the next version of Couchbase (6.5) that will be released later this year and about Couchbase 7.0 that will be released next year.
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
achine Learning Essentials Abstract:
Machine Learning (ML) is one of the hottest topics in the IT world today. But what is it really all about?
In this session we will talk about what ML actually is and in which cases it is useful.
We will talk about a few common algorithms for creating ML models and demonstrate their use with Python. We will also take a peek at Deep Learning (DL) and Artificial Neural Networks and explain how they work (without too much math) and demonstrate DL model with Python.
The target audience are developers, data engineers and DBAs that do not have prior experience with ML and want to know how it actually works.
Machine Learning Essentials Demystified part1 | Big Data DemystifiedOmid Vahdaty
Machine Learning Essentials Abstract:
Machine Learning (ML) is one of the hottest topics in the IT world today. But what is it really all about?
In this session we will talk about what ML actually is and in which cases it is useful.
We will talk about a few common algorithms for creating ML models and demonstrate their use with Python. We will also take a peek at Deep Learning (DL) and Artificial Neural Networks and explain how they work (without too much math) and demonstrate DL model with Python.
The target audience are developers, data engineers and DBAs that do not have prior experience with ML and want to know how it actually works.
The technology of fake news between a new front and a new frontier | Big Dat...Omid Vahdaty
קוראים לי ניצן אור קדראי ואני עומדת בצומת המעניינת שבין טכנולוגיה, מדיה ואקטיביזם.
בארבע וחצי השנים האחרונות אני עובדת בידיעות אחרונות, בהתחלה כמנהלת המוצר של אפליקציית ynet וכיום כמנהלת החדשנות.
הייתי שותפה בהקמת עמותת סטארט-אח, עמותה המספקת שירותי פיתוח ומוצר עבור עמותות אחרות, ולאחרונה מתעסקת בהקמת קהילה שמטרתה לחקור את ההיבטים הטכנולוגיים של תופעת הפייק ניוז ובניית כלים אפליקטיביים לצורך ניהול חכם של המלחמה בתופעה.
ההרצאה תדבר על תופעת הפייק ניוז. נתמקד בטכנולוגיה שמאפשרת את הפצת הפייק ניוז ונראה דוגמאות לשימוש בטכנולוגיה זו.
נבחן את היקף התופעה ברשתות החברתיות ונלמד איך ענקיות הטכנולוגיה מנסות להילחם בה.
Making your analytics talk business | Big Data DemystifiedOmid Vahdaty
MAKING YOUR ANALYTICS TALK BUSINESS
Aligning your analysis to the business is fundamental for all types of analytics (digital or product analytics, business intelligence, etc) and is vertical- and tool agnostic. In this talk we will build on the discussion that was started in the previous meetup, and will discuss how analysts can learn to derive their stakeholders' expectations, how to shift from metrics to "real" KPIs, and how to approach an analysis in order to create real impact.
This session is primarily geared towards those starting out into analytics, practitioners who feel that they are still struggling to prove their value in the organization or simply folks who want to power up their reporting and recommendation skills. If you are already a master at aligning your analysis to the business, you're most welcome as well: join us to share your experiences so that we can all learn from each other and improve!
Bios:
Eliza Savov - Eliza is the team lead of the Customer Experience and Analytics team at Clicktale, the worldwide leader in behavioral analytics. She has extensive experience working with data analytics, having previously worked at Clicktale as a senior customer experience analyst, and as a product analyst at Seeking Alpha.
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...Omid Vahdaty
In the talk we will discuss how to break down the company’s overall goals all the way to your BI team’s daily activities in 3 simple stages:
1. Understanding the path to success - Creating a revenue model
2. Gathering support and strategizing - Structuring a team
3. Executing - Tracking KPIs
Bios:
Omri Halak -Omri is the director of business operations at Logz.io, an intelligent and scalable machine data analytics platform built on ELK & Grafana that empowers engineers to monitor, troubleshoot, and secure mission-critical applications more effectively. In this position, Omri combines actionable business insights from the BI side with fast and effective delivery on the Operations side. Omri has ample experience connecting data with business, with previous positions at SimilarWeb as a business analyst, at Woobi as finance director, and as Head of State Guarantees at Israel Ministry of Finance.
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...Omid Vahdaty
Lecturer has Deep experience defining Cloud computing, security models for IaaS, PaaS, and SaaS architectures specifically as the architecture relates to IAM. Deep Experience Defining Privacy protection Policy, a big fan of GDPR interpretation.
DeelExperience in Information security, Defining Healthcare security best practices including AI and Big Data, IT Security and ICS security and privacy controls in the industrial environments.
Deep knowledge of security frameworks such as Cloud Security Alliance (CSA), International Organization for Standardization (ISO), National Institute of Standards and Technology (NIST), IBM ITCS104 etc.
What Will You learn:
Every day, the website collects a huge amount of data. The data allows to analyze the behavior of Internet users, their interests, their purchasing behavior and the conversion rates. In order to increase business, big data offers the tools to analyze and process data in order to reveal competitive advantages from the data.
What Healthcare has to do with Big Data
How AI can assist in patient care?
Why some are afraid? Are there any dangers?
Aerospike meetup july 2019 | Big Data DemystifiedOmid Vahdaty
Building a low latency (sub millisecond), high throughput database that can handle big data AND linearly scale is not easy - but we did it anyway...
In this session we will get to know Aerospike, an enterprise distributed primary key database solution.
- We will do an introduction to Aerospike - basic terms, how it works and why is it widely used in mission critical systems deployments.
- We will understand the 'magic' behind Aerospike ability to handle small, medium and even Petabyte scale data, and still guarantee predictable performance of sub-millisecond latency
- We will learn how Aerospike devops is different than other solutions in the market, and see how easy it is to run it on cloud environments as well as on premise.
We will also run a demo - showing a live example of the performance and self-healing technologies the database have to offer.
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...Omid Vahdaty
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS
-Learn how to connect BI and product management to solve business problems
-Discover how to lead clients to ask the right questions to get the data and insight they really want
-Get pointers on saving your time and your company's resources by understanding what your customers need, not what they ask for
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
AWS Big Data Demystified is all about knowledge sharing b/c knowledge should be given for free. in this lecture we will dicusss the advantages of working with Zeppelin + spark sql, jdbc + thrift, ganglia, r+ spark r + livy, and a litte bit about ganglia on EMR.\
subscribe to you youtube channel to see the video of this lecture:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
Big data makes you a bit Confused ? messaging? batch processing? data streaming? in flight analytics? Cloud? open source? Flume? kafka? flafka (both)? SQS? kinesis? firehose?
comprehensive Introduction to NoSQL solutions inside the big data landscape. Graph store? Column store? key Value store? Document Store? redis or memcache? dynamo db? mongo db ? hbase? Cloud or open source?
Courier management system project report.pdfKamal Acharya
It is now-a-days very important for the people to send or receive articles like imported furniture, electronic items, gifts, business goods and the like. People depend vastly on different transport systems which mostly use the manual way of receiving and delivering the articles. There is no way to track the articles till they are received and there is no way to let the customer know what happened in transit, once he booked some articles. In such a situation, we need a system which completely computerizes the cargo activities including time to time tracking of the articles sent. This need is fulfilled by Courier Management System software which is online software for the cargo management people that enables them to receive the goods from a source and send them to a required destination and track their status from time to time.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSEDuvanRamosGarzon1
AIRCRAFT GENERAL
The Single Aisle is the most advanced family aircraft in service today, with fly-by-wire flight controls.
The A318, A319, A320 and A321 are twin-engine subsonic medium range aircraft.
The family offers a choice of engines
2. Agenda
● Disclaimer
○ I am not a network and security expert
○ This is not a security and network lecture
● Agenda….
○ All the possible consideration for my data?
(access,regulation, availability etc) == Data
Governance
○ What are the architecture implications?
● My Advice:
○ Stick to high level overview
○ Remember the topic not the details.
○ ASK me questions during the lecture!
4. Big Data Generic Architecture | Summary
Data Collection
S3
Data Transformation
Data Modeling
Data Visualization
5. Before I forget: for new AWS accounts….
● Disable unused regions via IAM
● Set the the limit for the instances your are using , e.g 50 instances
● Set the limit for the instances your are not using to 0!
● Remove access key secret key for root account
● Don't use root account
● Use MFA
6. Data Governance…
Data Security Level
Data governance is the capability that
enables an organization to ensure that
high data quality exists throughout the
complete lifecycle of the data. The key
focus areas of data governance include
availability, usability, consistency,
data integrity and data security and
includes establishing processes to
ensure effective data management
throughout the enterprise such as
accountability for the adverse effects of
poor data quality and ensuring that the
data which an enterprise has can be
used by the entire organization.
9. AWS Security options in general [ sorry, not covering everything… ]
● IAM : Identity management
○ IAM : Identity management
■ User, policy, roles,group, least privileges, MFA
○ Key Management Service
■ Server Side
■ Client side
○ Disable Data centers, unused instance families,
○ Limit resources
○ Account segregation
○ Identity based policies!
● Cloud trail: enables governance, compliance, auditing, and risk auditing
● S3: Resource management (e.g s3)
○ Write only, read only, no delete
○ Versioning, encryption,life cycle policy
10. AWS Network options in general
● VPC
○ Create a Non Default VPC
○ Private network + Bastion host + ALB
○ Public subnet vs private subnet
○ VPC Endpoints
○ NACL
○ SG
○ IG
○ VPC peering
○ Site 2 Site - it is per VPC…. choose carefully
● Direct Connect VS HTTPS
● Cloudfront with GEOlocation protection
● https://amazon-aws-big-data-demystified.ninja/2018/06/27/aws-enterprise-
grade-networking-security-what-are-your-options-to-protect-your-bigdata/
17. Data Governance…
Storage Level
Data governance is the capability that
enables an organization to ensure that
high data quality exists throughout the
complete lifecycle of the data. The key
focus areas of data governance include
availability, usability, consistency,
data integrity and data security and
includes establishing processes to
ensure effective data management
throughout the enterprise such as
accountability for the adverse effects of
poor data quality and ensuring that the
data which an enterprise has can be
used by the entire organization.
18. S3 coolest options on AWS
● All of your data is replicated on multi AZ!
● Life Cycle policy
● Versioning
● Cross region replication
● Undeletable bucket - No delete policy
● MFA on delete Action - with time interval of 1 min or per action.
● Deny unencrypted data write.
● Analytics
19. General Security Concepts
| Good to know!
● protecting data while
○ in-transit (as it travels to and from Amazon S3) , 2 ways:
■ by using SSL
○ at rest (while it is stored on disks in Amazon S3 data centers) 2 ways:
■ Server Side encryption. (SSE)
■ client-side encryption.
○ In use:
■ Hasing…. (dictionary attack?)
■ Hashing with key
■ Any Encryption
20. Blog: S3 security options in detail
https://amazon-aws-big-data-demystified.ninja/2018/06/27/aws-s3-security-
introduction-and-access-management/
● Detailed Encryption options in AWS
● Resource based policy VS Identity based policy
21. Server Side Encryption (SSE) summary
● Server-Side Encryption with Customer-Provided Keys (SSE-C)
■ You manage the encryption keys and Amazon S3 manages the encryption, as it writes to
disks, and decryption, when you access your objects
● S3-Managed Keys (SSE-S3)
● AWS KMS-Managed Keys (SSE-KMS)
22. Additional aws s3 Safeguard
1. VPN (site to site)
2. IP ACL
3. Identity Based policy (who? Me? S3 read only?)
4. Resourced based policy : e.g deny delete requests/encrypted objects
5. https://amazon-aws-big-data-demystified.ninja/2018/06/27/aws-s3-security-
introduction-and-access-management/
23. Basic S3 Security Diagram
Destination
AWS S3
Source
Datacenter
(secured)
Data in Transit
Client side encryption
HTTPS / SSL
VPN
Data at Rest
(Server/Client side encryption)
Identity based policy:only
myUser, access only to
s3,write only
Resource based policy:
denyUnencrypted,Deny
Delete,Deny Policy Change
Accept only from DC IP
24. Data Governance…
Data Availability Level
Data governance is the capability that
enables an organization to ensure that
high data quality exists throughout the
complete lifecycle of the data. The key
focus areas of data governance include
availability, usability, consistency,
data integrity and data security and
includes establishing processes to
ensure effective data management
throughout the enterprise such as
accountability for the adverse effects of
poor data quality and ensuring that the
data which an enterprise has can be
used by the entire organization.
25. Cross Region Replication
● https://docs.aws.amaz
on.com/AmazonS3/lat
est/dev/crr.html
● Versioning must be
enabled
● SSL security at transit
27. Cross Region & Cross account replication
https://docs.aws.amazon.com/AmazonS3/latest/dev/crr-walkthrough-2.html
Account A Account B
28. Use Case used by Walla for DR purposes
● Cross account & Cross region replication
○ Copy all backups & and mission critical data
○ Copy all Cloud Formation templates
○ Copy all Code
29. Data Governance…
Big Data Security
Architecture Level
(At rest, In motion, In use)
Data governance is the capability that
enables an organization to ensure that
high data quality exists throughout the
complete lifecycle of the data. The key
focus areas of data governance include
availability, usability, consistency,
data integrity and data security and
includes establishing processes to
ensure effective data management
throughout the enterprise such as
accountability for the adverse effects of
poor data quality and ensuring that the
data which an enterprise has can be
used by the entire organization.
30. Architecture Security Consideration
● Basicly for each component
○ At rest ,At motion,In use
● Understand different Encryption options per component
○ Prefer SSE
○ Rotate your KMS key X periods
● Restrict access via -
○ Identity based policy: Role based security per cluster/technology
○ Resource based policy
○ least privileges, avoid admin users as much as possible.
○ Manage your EC2 keys wisely
● Understand differences of
○ Direct Connect VS public internet VS VPN and their impact on your application
○ Security groups / NACL / IP Based protection
○ private subnet / public subnet on your app
● For web Consider Cloudfront for GEOlocation protection
31. EMR Security
● EMR specific:
○ Security configurations
○ Kerberos
● All other webs (Zeppelin, Hue, Oozie)
○ SSL
○ user/password prefer LDAP when possible
○ Shiro use role based access.
● EMR Role → restrict access to s3 buckets
● Identity level - least privileges
● application level (e.g.hive: user level, table level, row level, columns level,
kerberos etc)
32. Data governance…
Privacy in a nutshell: PII
Personal Identifiable Information
Data governance is the capability that
enables an organization to ensure that
high data quality exists throughout the
complete lifecycle of the data. The key
focus areas of data governance include
availability, usability, consistency,
data integrity and data security and
includes establishing processes to
ensure effective data management
throughout the enterprise such as
accountability for the adverse effects of
poor data quality and ensuring that the
data which an enterprise has can be
used by the entire organization.
33. PII Privacy challenges in a nutshell
● Problem
○ Assume The attacker has Admin permissions
○ According to Lawyers - no such things a anonymous DB →
■ Your anonymous DB
■ Public 3rd party personal identifiable information DB
■ Joined together… implies your DB is not anonymous.
● Solution
○ Obfuscate & Aggregate your data wherever possible to avoid a scenario where an
anonymous user can be reversed engineered based on the data (location history, habits +
timestamp etc)
○ Keep only RAW data you need.
34. Data governance…
Usability,
& Fine Grained Access Control
Data governance is the capability that
enables an organization to ensure that
high data quality exists throughout the
complete lifecycle of the data. The key
focus areas of data governance include
availability, usability, consistency,
data integrity and data security and
includes establishing processes to
ensure effective data management
throughout the enterprise such as
accountability for the adverse effects of
poor data quality and ensuring that the
data which an enterprise has can be
used by the entire organization.
35. s3
Example AWS Architecture in one account
(nothing special)
Region EU-WEST-1
Region US-EAST-1
VPC Prod1 VPC stg
VPC Prod2 VPC stg
peering S3 endpoint
36. Account Segregations Motivation
● Fine grained access control.
● Fine grained billing control.
● Limit your Blast radius -
○ what happens if your account is
hacked?
○ What happens if one account out
of 10 of your accounts is
hacked?
37. Simple access control in one VPC
Data engineering
EMR Cluster (transformation) ,
RW access to all buckets except modeling
Data scientist
EMR Cluster (modeling) , no access to RAW data,
Read only access to transformation, and RW to modeling
S3 raw buckets
S3
transformation
buckets
S3 Modeling
buckets
Data source -
write only
Read Only
Full Control
write only
38. Simple Account segregation for Business units
example
[common use case]
Users
account1
Login only
Business unit 1 account1
Business unit 2 account3
DR account4
Assume
role
Admin
Bi team
PreSale
Pre sale and BI may have access to both business units , or just to one business unit,
Admin has access to everything as usual
39. Account segregations per GEOlocation
[data must not leave country use case]
Users account
Login only
US-east-1 - N.virgina
EU-west-2 - london
DR account
Assume
role
Admin
London
NewYork
40. Fine grained access control via Account segregations
[my use case]
Users
account
Login
only
In bound Data Transformation Account
per data source
(RAW data, cleansing, encryption)
Data Modeling and obfuscation on
transformed data (big data)
DR account
Assume
role
DevOps
Data
Science
Data
Engineer
Prod
Outbound Data Transformation
Account per data source
(aggregated data, obfuscated data, re-
encryption)
41. Fine Grained Data Governance Flow via Account segregations
Visualize
/ Prod
account
Inbound Data transformation per
account pere data source if
needed.
Modeling & obfuscation account
Outbound Data transformation
account
Raw data1
account1
READ / WRITE Encrypted PII
DATA+user
Hashed data DATA
(hashed user id, hashed data, non PII data)
Obfuscated data, Data modeling,
hased users, non pii
Raw data2
account2
42. Big Data Generic Architecture | one account
Data Collection
S3
Data Transformation
Data Modeling
Data Visualization
43. Big Data Generic Architecture | acc. seg.
Data ingestion & RAW Data
Data Transformation
Data Modeling & obfuscation
Data Visualization
44. Summary and Take messages
● Design you network from the ground up, always keep your big data in mind
○ Data access, latency, Data encryption in motion, Performance impact
○ Avoid public IP’s when possible. If possible Use Direct Connect / VPN
● Design your data access layers carefully per your organization needs.
○ RAW data + encryption @rest & @motion
○ Keep DR on data level in mind. [seperate accounts + separate region!]
○ Transformation + hashing level [inbound/outbound]
○ Modeling + obfuscation level
● Architecture impact
○ Security [at rest, in motion, in use] for each component
○ Identity based policy
○ Resource based policy
○ Account segregations, to avoid blast radius
45. Stay in touch...
● Omid Vahdaty
● +972-54-2384178
● https://amazon-aws-big-data-demystified.ninja/
● Join our meetup, FB group and youtube channel
○ https://www.meetup.com/AWS-Big-Data-Demystified/
○ https://www.facebook.com/groups/amazon.aws.big.data.demystified/
○ https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber