This 1-day course provides hands-on skills in ingesting, analyzing, transforming and visualizing data using AWS Athena and getting the best performance when using it at scale.
Audience:
This class is intended for data engineers, analysts and data scientists responsible for: analyzing and visualizing big data, implementing cloud-based big data solutions, deploying or migrating big data applications to the public cloud, implementing and maintaining large-scale data storage environments, and transforming/processing big data.
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
Tomer Shiran est le fondateur et chef de produit (CPO) de Dremio. Tomer était le 4e employé et vice-président produit de MapR, un pionnier de l'analyse du Big Data. Il a également occupé de nombreux postes de gestion de produits et d'ingénierie chez IBM Research et Microsoft, et a fondé plusieurs sites Web qui ont servi des millions d'utilisateurs. Il est titulaire d'un Master en génie informatique de l'Université Carnegie Mellon et d'un Bachelor of Science en informatique du Technion - Israel Institute of Technology.
Le Modern Data Stack meetup est ravi d'accueillir Tomer Shiran. Depuis Apache Drill, Apache Arrow maintenant Apache Iceberg, il ancre avec ses équipes des choix pour Dremio avec une vision de la plateforme de données “ouverte” basée sur des technologies open source. En plus, de ces valeurs qui évitent le verrouillage de clients dans des formats propriétaires, il a aussi le souci des coûts qu’engendrent de telles plateformes. Il sait aussi proposer un certain nombre de fonctionnalités qui transforment la gestion de données grâce à des initiatives telles Nessie qui ouvre la route du Data As Code et du transactionnel multi-processus.
Le Modern Data Stack Meetup laisse “carte blanche” à Tomer Shiran afin qu’il nous partage son expérience et sa vision quant à l’Open Data Lakehouse.
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
Business leads, executives, analysts, and data scientists rely on up-to-date information to make business decision, adjust to the market, meet needs of their customers or run effective supply chain operations.
Come hear how Asurion used Delta, Structured Streaming, AutoLoader and SQL Analytics to improve production data latency from day-minus-one to near real time Asurion’s technical team will share battle tested tips and tricks you only get with certain scale. Asurion data lake executes 4000+ streaming jobs and hosts over 4000 tables in production Data Lake on AWS.
Data Privacy with Apache Spark: Defensive and Offensive ApproachesDatabricks
In this talk, we’ll compare different data privacy techniques & protection of personally identifiable information and their effects on statistical usefulness, re-identification risks, data schema, format preservation, read & write performance.
We’ll cover different offense and defense techniques. You’ll learn what k-anonymity and quasi-identifier are. Think of discovering the world of suppression, perturbation, obfuscation, encryption, tokenization, watermarking with elementary code examples, in case no third-party products cannot be used. We’ll see what approaches might be adopted to minimize the risks of data exfiltration.
Building End-to-End Delta Pipelines on GCPDatabricks
Delta has been powering many production pipelines at scale in the Data and AI space since it has been introduced for the past few years.
Built on open standards, Delta provides data reliability, enhances storage and query performance to support big data use cases (both batch and streaming), fast interactive queries for BI and enabling machine learning. Delta has matured over the past couple of years in both AWS and AZURE and has become the de-facto standard for organizations building their Data and AI pipelines.
In today’s talk, we will explore building end-to-end pipelines on the Google Cloud Platform (GCP). Through presentation, code examples and notebooks, we will build the Delta Pipeline from ingest to consumption using our Delta Bronze-Silver-Gold architecture pattern and show examples of Consuming the delta files using the Big Query Connector.
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
The Microsoft Well Architected Framework For Data AnalyticsStephanie Locke
With more than a decade of organizations running large data & analytics workloads in the cloud, Microsoft have extended their architecture framework to provide best practices and guidance for businesses. In this session, we’ll introduce the 'Well Architected Framework', go into detail about effective data architectures, and give you concrete next steps you can take whether you already have a cloud data architecture or are planning your first implementation.
Should I move my database to the cloud?James Serra
So you have been running on-prem SQL Server for a while now. Maybe you have taken the step to move it from bare metal to a VM, and have seen some nice benefits. Ready to see a TON more benefits? If you said “YES!”, then this is the session for you as I will go over the many benefits gained by moving your on-prem SQL Server to an Azure VM (IaaS). Then I will really blow your mind by showing you even more benefits by moving to Azure SQL Database (PaaS/DBaaS). And for those of you with a large data warehouse, I also got you covered with Azure SQL Data Warehouse. Along the way I will talk about the many hybrid approaches so you can take a gradual approve to moving to the cloud. If you are interested in cost savings, additional features, ease of use, quick scaling, improved reliability and ending the days of upgrading hardware, this is the session for you!
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
Tomer Shiran est le fondateur et chef de produit (CPO) de Dremio. Tomer était le 4e employé et vice-président produit de MapR, un pionnier de l'analyse du Big Data. Il a également occupé de nombreux postes de gestion de produits et d'ingénierie chez IBM Research et Microsoft, et a fondé plusieurs sites Web qui ont servi des millions d'utilisateurs. Il est titulaire d'un Master en génie informatique de l'Université Carnegie Mellon et d'un Bachelor of Science en informatique du Technion - Israel Institute of Technology.
Le Modern Data Stack meetup est ravi d'accueillir Tomer Shiran. Depuis Apache Drill, Apache Arrow maintenant Apache Iceberg, il ancre avec ses équipes des choix pour Dremio avec une vision de la plateforme de données “ouverte” basée sur des technologies open source. En plus, de ces valeurs qui évitent le verrouillage de clients dans des formats propriétaires, il a aussi le souci des coûts qu’engendrent de telles plateformes. Il sait aussi proposer un certain nombre de fonctionnalités qui transforment la gestion de données grâce à des initiatives telles Nessie qui ouvre la route du Data As Code et du transactionnel multi-processus.
Le Modern Data Stack Meetup laisse “carte blanche” à Tomer Shiran afin qu’il nous partage son expérience et sa vision quant à l’Open Data Lakehouse.
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
Business leads, executives, analysts, and data scientists rely on up-to-date information to make business decision, adjust to the market, meet needs of their customers or run effective supply chain operations.
Come hear how Asurion used Delta, Structured Streaming, AutoLoader and SQL Analytics to improve production data latency from day-minus-one to near real time Asurion’s technical team will share battle tested tips and tricks you only get with certain scale. Asurion data lake executes 4000+ streaming jobs and hosts over 4000 tables in production Data Lake on AWS.
Data Privacy with Apache Spark: Defensive and Offensive ApproachesDatabricks
In this talk, we’ll compare different data privacy techniques & protection of personally identifiable information and their effects on statistical usefulness, re-identification risks, data schema, format preservation, read & write performance.
We’ll cover different offense and defense techniques. You’ll learn what k-anonymity and quasi-identifier are. Think of discovering the world of suppression, perturbation, obfuscation, encryption, tokenization, watermarking with elementary code examples, in case no third-party products cannot be used. We’ll see what approaches might be adopted to minimize the risks of data exfiltration.
Building End-to-End Delta Pipelines on GCPDatabricks
Delta has been powering many production pipelines at scale in the Data and AI space since it has been introduced for the past few years.
Built on open standards, Delta provides data reliability, enhances storage and query performance to support big data use cases (both batch and streaming), fast interactive queries for BI and enabling machine learning. Delta has matured over the past couple of years in both AWS and AZURE and has become the de-facto standard for organizations building their Data and AI pipelines.
In today’s talk, we will explore building end-to-end pipelines on the Google Cloud Platform (GCP). Through presentation, code examples and notebooks, we will build the Delta Pipeline from ingest to consumption using our Delta Bronze-Silver-Gold architecture pattern and show examples of Consuming the delta files using the Big Query Connector.
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
The Microsoft Well Architected Framework For Data AnalyticsStephanie Locke
With more than a decade of organizations running large data & analytics workloads in the cloud, Microsoft have extended their architecture framework to provide best practices and guidance for businesses. In this session, we’ll introduce the 'Well Architected Framework', go into detail about effective data architectures, and give you concrete next steps you can take whether you already have a cloud data architecture or are planning your first implementation.
Should I move my database to the cloud?James Serra
So you have been running on-prem SQL Server for a while now. Maybe you have taken the step to move it from bare metal to a VM, and have seen some nice benefits. Ready to see a TON more benefits? If you said “YES!”, then this is the session for you as I will go over the many benefits gained by moving your on-prem SQL Server to an Azure VM (IaaS). Then I will really blow your mind by showing you even more benefits by moving to Azure SQL Database (PaaS/DBaaS). And for those of you with a large data warehouse, I also got you covered with Azure SQL Data Warehouse. Along the way I will talk about the many hybrid approaches so you can take a gradual approve to moving to the cloud. If you are interested in cost savings, additional features, ease of use, quick scaling, improved reliability and ending the days of upgrading hardware, this is the session for you!
Here are the slides for my talk "An intro to Azure Data Lake" at Techorama NL 2018. The session was held on Tuesday October 2nd from 15:00 - 16:00 in room 7.
Delta from a Data Engineer's PerspectiveDatabricks
Take a walk through the daily struggles of a data engineer in this presentation as we cover what is truly needed to create robust end to end Big Data solutions.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Databricks: A Tool That Empowers You To Do More With DataDatabricks
In this talk we will present how Databricks has enabled the author to achieve more with data, enabling one person to build a coherent data project with data engineering, analysis and science components, with better collaboration, better productionalization methods, with larger datasets and faster.
The talk will include a demo that will illustrate how the multiple functionalities of Databricks help to build a coherent data project with Databricks jobs, Delta Lake and auto-loader for data engineering, SQL Analytics for Data Analysis, Spark ML and MLFlow for data science, and Projects for collaboration.
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Databricks
The increasing challenge to serve ever-growing data driven by AI and analytics workloads makes disaggregated storage and compute more attractive as it enables companies to scale their storage and compute capacity independently to match data & compute growth rate. Cloud based big data services is gaining momentum as it provides simplified management, elasticity, and pay-as-you-go model.
Introducing AWS Glue, a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. In this session, we will introduce key ETL features of AWS Glue and cover common use cases ranging from scheduled nightly data warehouse loads to near real-time, event-driven ETL flows for your data lake. Join us as we walk through the process of building scalable, efficient, and serverless ETL pipelines.
Speaker: Craig Roach, Solutions Architect, AWS
FLiP Into Trino
FLiP into Trino. Flink Pulsar Trino
Pulsar SQL (Trino/Presto)
Remember the days when you could wait until your batch data load was done and then you could run some simple queries or build stale dashboards? Those days are over, today you need instant analytics as the data is streaming in real-time. You need universal analytics where that data is. I will show you how to do this utilizing the latest cloud native open source tools. In this talk we will utilize Trino, Apache Pulsar, Pulsar SQL and Apache Flink to analyze instantly data from IoT, sensors, transportation systems, Logs, REST endpoints, XML, Images, PDFs, Documents, Text, semistructured data, unstructured data, structured data and a hundred data sources you could never dream of streaming before. I will teach how to use Pulsar SQL to run analytics on live data.
Tim Spann
Developer Advocate
StreamNative
David Kjerrumgaard
Developer Advocate
StreamNative
https://www.starburst.io/info/trinosummit/
https://github.com/tspannhw/FLiP-Into-Trino/blob/main/README.md
https://github.com/tspannhw/StreamingAnalyticsUsingFlinkSQL/tree/main/src/main/java
select * from pulsar."public/default"."weather";
Apache Pulsar plus Trio = fast analytics at scale
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.
In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.
This talk will break down merge in Delta Lake—what is actually happening under the hood—and then explain about how you can optimize a merge. There are even some code snippet and sample configs that will be shared.
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
Lambda architectures, data warehouses, data lakes, on-premise Hadoop deployments, elastic Cloud architecture… We’ve had to deal with most of these at one point or another in our lives when working with data. At Databricks, we have built data pipelines, which leverage these architectures. We work with hundreds of customers who also build similar pipelines. We observed some common pain points along the way: the HiveMetaStore can easily become a bottleneck, S3’s eventual consistency is annoying, file listing anywhere becomes a bottleneck once tables exceed a certain scale, there’s not an easy way to guarantee atomicity – garbage data can make it into the system along the way. The list goes on and on.
Fueled with the knowledge of all these pain points, we set out to make Structured Streaming the engine to ETL and analyze data. In this talk, we will discuss how we built robust, scalable, and performant multi-cloud data pipelines leveraging Structured Streaming, Databricks Delta, and other specialized features available in Databricks Runtime such as file notification based streaming sources and optimizations around Databricks Delta leveraging data skipping and Z-Order clustering.
You will walkway with the essence of what to consider when designing scalable data pipelines with the recent innovations in Structured Streaming and Databricks Runtime.
Cloud Spanner is the first and only relational database service that is both strongly consistent and horizontally scalable. With Cloud Spanner you enjoy all the traditional benefits of a relational database: ACID transactions, relational schemas (and schema changes without downtime), SQL queries, high performance, and high availability. But unlike any other relational database service, Cloud Spanner scales horizontally, to hundreds or thousands of servers, so it can handle the highest of transactional workloads.
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAmazon Web Services
Amazon Athena is a new serverless query service that makes it easy to analyze data in Amazon S3, using standard SQL. With Athena, there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3.
In this webinar, we will show you how easy it is to start querying your data stored in Amazon S3, with Amazon Athena. First we will use Athena to create the schema for data already in S3. Then, we will demonstrate how you can run interactive queries through the built-in query editor. We will provide best practices and use cases for Athena. Then, we will talk about supported queries, data formats, and strategies to save costs when querying data with Athena.
Learning Objectives:
• Learn about the capabilities and features of Amazon Athena
• Understand the different use cases
• Describe how to run queries and options to store and visualize results
• Understand integration with other AWS big data services such as Amazon QuickSight
Here are the slides for my talk "An intro to Azure Data Lake" at Techorama NL 2018. The session was held on Tuesday October 2nd from 15:00 - 16:00 in room 7.
Delta from a Data Engineer's PerspectiveDatabricks
Take a walk through the daily struggles of a data engineer in this presentation as we cover what is truly needed to create robust end to end Big Data solutions.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Databricks: A Tool That Empowers You To Do More With DataDatabricks
In this talk we will present how Databricks has enabled the author to achieve more with data, enabling one person to build a coherent data project with data engineering, analysis and science components, with better collaboration, better productionalization methods, with larger datasets and faster.
The talk will include a demo that will illustrate how the multiple functionalities of Databricks help to build a coherent data project with Databricks jobs, Delta Lake and auto-loader for data engineering, SQL Analytics for Data Analysis, Spark ML and MLFlow for data science, and Projects for collaboration.
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Databricks
The increasing challenge to serve ever-growing data driven by AI and analytics workloads makes disaggregated storage and compute more attractive as it enables companies to scale their storage and compute capacity independently to match data & compute growth rate. Cloud based big data services is gaining momentum as it provides simplified management, elasticity, and pay-as-you-go model.
Introducing AWS Glue, a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. In this session, we will introduce key ETL features of AWS Glue and cover common use cases ranging from scheduled nightly data warehouse loads to near real-time, event-driven ETL flows for your data lake. Join us as we walk through the process of building scalable, efficient, and serverless ETL pipelines.
Speaker: Craig Roach, Solutions Architect, AWS
FLiP Into Trino
FLiP into Trino. Flink Pulsar Trino
Pulsar SQL (Trino/Presto)
Remember the days when you could wait until your batch data load was done and then you could run some simple queries or build stale dashboards? Those days are over, today you need instant analytics as the data is streaming in real-time. You need universal analytics where that data is. I will show you how to do this utilizing the latest cloud native open source tools. In this talk we will utilize Trino, Apache Pulsar, Pulsar SQL and Apache Flink to analyze instantly data from IoT, sensors, transportation systems, Logs, REST endpoints, XML, Images, PDFs, Documents, Text, semistructured data, unstructured data, structured data and a hundred data sources you could never dream of streaming before. I will teach how to use Pulsar SQL to run analytics on live data.
Tim Spann
Developer Advocate
StreamNative
David Kjerrumgaard
Developer Advocate
StreamNative
https://www.starburst.io/info/trinosummit/
https://github.com/tspannhw/FLiP-Into-Trino/blob/main/README.md
https://github.com/tspannhw/StreamingAnalyticsUsingFlinkSQL/tree/main/src/main/java
select * from pulsar."public/default"."weather";
Apache Pulsar plus Trio = fast analytics at scale
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.
In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.
This talk will break down merge in Delta Lake—what is actually happening under the hood—and then explain about how you can optimize a merge. There are even some code snippet and sample configs that will be shared.
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
Lambda architectures, data warehouses, data lakes, on-premise Hadoop deployments, elastic Cloud architecture… We’ve had to deal with most of these at one point or another in our lives when working with data. At Databricks, we have built data pipelines, which leverage these architectures. We work with hundreds of customers who also build similar pipelines. We observed some common pain points along the way: the HiveMetaStore can easily become a bottleneck, S3’s eventual consistency is annoying, file listing anywhere becomes a bottleneck once tables exceed a certain scale, there’s not an easy way to guarantee atomicity – garbage data can make it into the system along the way. The list goes on and on.
Fueled with the knowledge of all these pain points, we set out to make Structured Streaming the engine to ETL and analyze data. In this talk, we will discuss how we built robust, scalable, and performant multi-cloud data pipelines leveraging Structured Streaming, Databricks Delta, and other specialized features available in Databricks Runtime such as file notification based streaming sources and optimizations around Databricks Delta leveraging data skipping and Z-Order clustering.
You will walkway with the essence of what to consider when designing scalable data pipelines with the recent innovations in Structured Streaming and Databricks Runtime.
Cloud Spanner is the first and only relational database service that is both strongly consistent and horizontally scalable. With Cloud Spanner you enjoy all the traditional benefits of a relational database: ACID transactions, relational schemas (and schema changes without downtime), SQL queries, high performance, and high availability. But unlike any other relational database service, Cloud Spanner scales horizontally, to hundreds or thousands of servers, so it can handle the highest of transactional workloads.
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAmazon Web Services
Amazon Athena is a new serverless query service that makes it easy to analyze data in Amazon S3, using standard SQL. With Athena, there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3.
In this webinar, we will show you how easy it is to start querying your data stored in Amazon S3, with Amazon Athena. First we will use Athena to create the schema for data already in S3. Then, we will demonstrate how you can run interactive queries through the built-in query editor. We will provide best practices and use cases for Athena. Then, we will talk about supported queries, data formats, and strategies to save costs when querying data with Athena.
Learning Objectives:
• Learn about the capabilities and features of Amazon Athena
• Understand the different use cases
• Describe how to run queries and options to store and visualize results
• Understand integration with other AWS big data services such as Amazon QuickSight
Vadim Solovey is a CTO of DoiT International has helped to implement Google BigQuery as a cloud data warehouse for many medium and large sized data and analytics initiatives. BigQuery’s serverless architecture had redefined what it means to be fully managed for hundreds of Israeli's startups.
Recently, Google announced an update to BigQuery that dramatically advances cloud data analytics for large-scale businesses such as BigQuery now support Standard SQL, implementing the SQL 2011 standard as well as new ODBC drivers making it possible to use BigQuery with a number of tools ranging from Microsoft Excel to traditional business intelligence systems such as Microstrategy and Qlik.
Agenda:
• Partitioned tables
• The ability to update, delete rows and columns using SQL
• Integration with IAM for fine-grained security policies
• Monitoring w/ StackDriver to track performance and usage
• Query sharing via links, to foster knowledge within orgs
• Cost optimisation strategies
AWS Athena vs. Google BigQuery for interactive SQL QueriesDoiT International
During the re:Invent 2016, AWS has released the Amazon Athena - an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
We took a look on AWS Athena and compared it to the Google BigQuery - another player of serverless interactive data analysis.
Would you like to know which one is the right tool for you? Join us for this meetup to learn AWS Athena and for the test drive of querying exactly the same dataset using AWS Athena and Google BigQuery to see where each one shines (or totally blows it).
Webinar: Fighting Fraud with Graph DatabasesDataStax
Modern fraud detection has significant engineering challenges. From managing the ingestion and scale, to the analysis of those patterns in real-time. We'll first take a look at how DataStax Enterprise Graph, powered by the industry’s best version of Apache Cassandra™, can meet those requirements to help you save the day.
We believe that security *IS* a shared responsibility, - when we give developers the power to create infrastructure, security became their responsibility, too.
During this meetup, we'd like to share our experience with implementing security best practices, to be implemented directly by development teams to build more robust and secure cloud environments. Make cloud security your team's sport!
وصف لتجربة إعداد فيلم لتدريب الطلاب الجدد على استخدام المراجع بجامعة ليدز عام 1986م, إلا أن العرض تطرق لتطور المواد السمعبصرية مرورا بالوسائط المتعددة, ووصولاً للوسائط الفائقة
Superfunds Magazine - Ready to take on the worldChloe Tilley
What is the true impact of globalisation on Australian equity investing? As globalisation and the rise of technology mean geographical isolation is no longer a barrier to offshore investment, Tim Samway contributes to Superfunds Magazine to take a look at where we are now and where we should be heading.
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQLAmazon Web Services
Amazon Athena is a new interactive query service that makes it easy to analyze data in Amazon S3, using standard SQL. Athena is serverless, so there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3.
In this session, we will show you how easy is to start querying your data stored in Amazon S3, with Amazon Athena. First we will use Athena to create the schema for data already in S3. Then, we will demonstrate how you can run interactive queries through the built-in query editor. We will provide best practices and use cases for Athena. Then, we will talk about supported queries, data formats, and strategies to save costs when querying data with Athena.
"In this session, you will learn how to easily access your data on S3, and how to visualize and generate insights from Amazon Athena and other data sources through Amazon QuickSight. In addition we will share some tips & best practices for using Athena & QuickSight.
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
Amazon QuickSight is a fast, cloud-powered business analytics service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from various data sources (Amazon Redshift, Amazon Athena, Amazon EMR, Amazon RDS and more)."
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.Amazon Web Services
Amazon Athena is a new interactive query service that makes it easy to analyze data in Amazon S3, using standard SQL. Athena is serverless, so there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3.
In this session, we will show you how easy is to start querying your data stored in Amazon S3, with Amazon Athena. First we will use Athena to create the schema for data already in S3. Then, we will demonstrate how you can run interactive queries through the built-in query editor. We will provide best practices and use cases for Athena. Then, we will talk about supported queries, data formats, and strategies to save costs when querying data with Athena.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.Amazon Web Services
Amazon Athena is a new interactive query service that makes it easy to analyze data in Amazon S3, using standard SQL. Athena is serverless, so there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3.
In this session, we will show you how easy is to start querying your data stored in Amazon S3, with Amazon Athena. First we will use Athena to create the schema for data already in S3. Then, we will demonstrate how you can run interactive queries through the built-in query editor. We will provide best practices and use cases for Athena. Then, we will talk about supported queries, data formats, and strategies to save costs when querying data with Athena.
by Joyjeet Banerjee, Solutions Architect, AWS
Amazon Athena is a new serverless query service that makes it easy to analyze data in Amazon S3, using standard SQL. With Athena, there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3. Level 200
In this session, we will show you how easy it is to start querying your data stored in Amazon S3, with Amazon Athena. First we will use Athena to create the schema for data already in S3. Then, we will demonstrate how you can run interactive queries through the built-in query editor. We will provide best practices and use cases for Athena. Then, we will talk about supported queries, data formats, and strategies to save costs when querying data with Athena.
"Conceptually, a data lake is a flat data store to collect data in its original form, without the need to enforce a predefined schema. Instead, new schemas or views are created “on demand”, providing a far more agile and flexible architecture while enabling new types of analytical insights. AWS provides many of the building blocks required to help organizations implement a data lake. In this session, we will introduce key concepts for a data lake and present aspects related to its implementation. We will discuss critical success factors, pitfalls to avoid as well as operational aspects such as security, governance, search, indexing and metadata management. We will also provide insight on how AWS enables a data lake architecture.
A data lake is a flat data store to collect data in its original form, without the need to enforce a predefined schema. Instead, new schemas or views are created ""on demand"", providing a far more agile and flexible architecture while enabling new types of analytical insights. AWS provides many of the building blocks required to help organizations implement a data lake. In this session, we introduce key concepts for a data lake and present aspects related to its implementation. We discuss critical success factors and pitfalls to avoid, as well as operational aspects such as security, governance, search, indexing, and metadata management. We also provide insight on how AWS enables a data lake architecture. Attendees get practical tips and recommendations to get started with their data lake implementations on AWS."
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Amazon Web Services
Learning Objectives:
- Understand how to build a serverless big data solution quickly and easily
- Learn how to discover and prepare all your data for analytics
- Learn how to query and visualize analytics on all your data to create actionable insights
Amazon Athena is a new serverless query service that makes it easy to analyze data in Amazon S3, using standard SQL. With Athena, there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3.
Data processing and analysis is where big data is most often consumed - driving business intelligence (BI) use cases that discover and report on meaningful patterns in the data. In this session, we will discuss options for processing, analyzing and visualizing data. We will also look at partner solutions and BI-enabling services from AWS. Attendees will learn about optimal approaches for stream processing, batch processing and Interactive analytics. AWS services to be covered include: Amazon Machine Learning, Elastic MapReduce (EMR), and Redshift.
Organizations often need to quickly analyze large amounts of data, such as logs generated from a wide variety of sources and formats. However, traditional approaches require a lot of time and effort designing complex data transformation and loading processes; and configuring data warehouses. Using AWS, you can start querying your datasets within minutes. In this session you will learn how you can deploy a managed Presto environment in minutes to interactively query log data using standard ANSI SQL. Presto is a popular open source SQL engine for running interactive analytic queries against data sources of all sizes. We will talk about common use cases and best practices for running Presto on Amazon EMR.
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Speakers:
Neel Mitra - Solutions Architect, AWS
Roger Dahlstrom - Solutions Architect, AWS
Data processing and analysis is where big data is most often consumed, driving business intelligence (BI) use cases that discover and report on meaningful patterns in the data. In this session, we will discuss options for processing, analyzing, and visualizing data. We will also look at partner solutions and BI-enabling services from AWS. Attendees will learn about optimal approaches for stream processing, batch processing, and interactive analytics with AWS services, such as, Amazon Machine Learning, Elastic MapReduce (EMR), and Redshift.
Created by: Jason Morris, Solutions Architect
Map Services on Amazon AWS, Microsoft Azure and Google Cloud Platform문기 박
Public cloud service map, researched by Bahk, Moon-Kee
Sources:
https://cloud.google.com/free/docs/map-aws-google-cloud-platform
https://cloud.google.com/free/docs/map-azure-google-cloud-platform
https://docs.microsoft.com/ko-kr/azure/architecture/aws-professional/services
Training Generative Adversarial Networks is a gentle process. Recent advances in GANs research resulted in incredible results of generated images. However, training GANs still remains tricky and slow.
With TensorFlow and NVidia’s new Volta architecture it is possible to reduce training time by up to 50% using simplified code.
1. Double Orchestration of Redis Enterprise cluster on Kubernetes
**************
In this session we'll display how we deploy a highly available database on Kubernetes. The considerations we took when deploying a stateful application, and the challenges of answering different clients' demands for different k8s environments.
2. Operators to the rescue: stateful applications made easy with operators
**************
Kubernetes 1.7 introduced an import feature called custom controllers. This allows you to customise your Kubernetes installation and add your own resources to be managed in the native Kubernetes manner.
The session will display the operator concept and cover our journey with developing the Redis Labs operator - why we chose it and how we use it.
Learn from the dozens of large-scale deployments how to get the most out of your Kubernetes environment:
- Container images optimization
- Organizing namespaces
- Readiness and Liveness probes
- Resource requests and limits
- Failing with grace
- Mapping external services
- Upgrading clusters with zero downtime
An Open-Source Platform to Connect, Manage, and Secure MicroservicesDoiT International
Services are at the core of modern software architecture. Deploying a series of modular, small (micro-)services rather than big monoliths gives developers the flexibility to work in different languages, technologies and release cadence across the system; resulting in higher productivity and velocity, especially for larger teams.
With the adoption of microservices, however, new problems emerge due to the sheer number of services that exist in a larger system. Problems that had to be solved once for a monolith, like security, load balancing, monitoring, and rate limiting need to be handled for each service.
Istio, announced at GlueCon 2017, addresses these problems in a fundamental way through a service mesh framework. With Istio, developers can implement the core logic for the microservices, and let the framework take care of the rest – traffic management, discovery, service identity and security, and policy enforcement. Better yet, this can be also done for existing microservices without rewriting or recompiling any of their parts. Istio uses Envoy as its runtime proxy component and provides an extensible intermediation layer which allows global cross-cutting policy enforcement and telemetry collection.
So you are deployed to production (or soon to be) with Elasticsearch running and powering important application features. Or maybe used for centralized logging for effective debugging.
Was your Elastic cluster deployed correctly? Is it stable? Can it hold the throughput you expect it to?
How did you do capacity planning? How to tell if the cluster is healthy and what to monitor? How to apply effective multi-tenancy? and what would be an ideal cluster topology and data ingestion architecture?
We already trust artificial intelligence to drive our car, but we still configure thresholds and thrift through logs manually. In this talk, Ronny Lehmann, Loom CTO will discuss how he spent months analyzing modern-ops work, until he finally was able to extract the common-basis practices; and how we used this understanding to build a machine that complements ops teams, automating much of the work which is more suitable for machines - leaving for "humans" just the parts which require humans. What we built saves you time spent on parsers, on configuring and tuning rules and alerts, on conducting root-cause analysis and triage - and finally - on figuring out what to do.
This meeting we'll host a discussion on Google Cloud Platform and Amazon Web Services to bring light to similarities and differences between platforms. If you have questions about how our platforms compare this is the meeting to attend!
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
Dataflow is a unified programming model and a managed service for developing and executing a wide range of data processing patterns including ETL, batch computation, and continuous computation. Cloud Dataflow frees you from operational tasks like resource management and performance optimization.
Kubernetes has been a key component for many companies to reduce technical debt in infrastructure by:
• Fostering the Adoption of Docker
• Simplifying Container Management
• Onboarding Developers On Infrastructure
• Unlocking Continuous Integration and Delivery
During this meetup we are going to discuss the following topics and share some best practices
• What's new with Kubernetes 1.3
• Generate Cluster Configuration using CloudFormation
• Deploy Kubernetes Clusters on AWS
• Scaling the Cluster
• Integrating Ingress with Elastic Load Balancer
• Using Internal ELB's as Kubernetes' Service
• Using EBS for persistent volumes
• Integrating Route53
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
Batch and Streaming Data Processing and Vizualize 300Tb in 5 Seconds meetup on April 18th, 2016 (http://www.meetup.com/Big-things-are-happening-here/events/229532500)
Kubernetes is making the promise of changing the datacenter from being a group of computer to "a computer" itself. This presentation outlines the new features in K8S with 1.1 and 1.2 release.
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC
Ellisha Heppner, Grant Management Lead, presented an update on APNIC Foundation to the PNG DNS Forum held from 6 to 10 May, 2024 in Port Moresby, Papua New Guinea.
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Brad Spiegel Macon GA’s journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, he’s not only bridging the gap in Macon but also setting an example for others to follow.
This 7-second Brain Wave Ritual Attracts Money To You.!nirahealhty
Discover the power of a simple 7-second brain wave ritual that can attract wealth and abundance into your life. By tapping into specific brain frequencies, this technique helps you manifest financial success effortlessly. Ready to transform your financial future? Try this powerful ritual and start attracting money today!
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesSanjeev Rampal
Talk presented at Kubernetes Community Day, New York, May 2024.
Technical summary of Multi-Cluster Kubernetes Networking architectures with focus on 4 key topics.
1) Key patterns for Multi-cluster architectures
2) Architectural comparison of several OSS/ CNCF projects to address these patterns
3) Evolution trends for the APIs of these projects
4) Some design recommendations & guidelines for adopting/ deploying these solutions.
# Internet Security: Safeguarding Your Digital World
In the contemporary digital age, the internet is a cornerstone of our daily lives. It connects us to vast amounts of information, provides platforms for communication, enables commerce, and offers endless entertainment. However, with these conveniences come significant security challenges. Internet security is essential to protect our digital identities, sensitive data, and overall online experience. This comprehensive guide explores the multifaceted world of internet security, providing insights into its importance, common threats, and effective strategies to safeguard your digital world.
## Understanding Internet Security
Internet security encompasses the measures and protocols used to protect information, devices, and networks from unauthorized access, attacks, and damage. It involves a wide range of practices designed to safeguard data confidentiality, integrity, and availability. Effective internet security is crucial for individuals, businesses, and governments alike, as cyber threats continue to evolve in complexity and scale.
### Key Components of Internet Security
1. **Confidentiality**: Ensuring that information is accessible only to those authorized to access it.
2. **Integrity**: Protecting information from being altered or tampered with by unauthorized parties.
3. **Availability**: Ensuring that authorized users have reliable access to information and resources when needed.
## Common Internet Security Threats
Cyber threats are numerous and constantly evolving. Understanding these threats is the first step in protecting against them. Some of the most common internet security threats include:
### Malware
Malware, or malicious software, is designed to harm, exploit, or otherwise compromise a device, network, or service. Common types of malware include:
- **Viruses**: Programs that attach themselves to legitimate software and replicate, spreading to other programs and files.
- **Worms**: Standalone malware that replicates itself to spread to other computers.
- **Trojan Horses**: Malicious software disguised as legitimate software.
- **Ransomware**: Malware that encrypts a user's files and demands a ransom for the decryption key.
- **Spyware**: Software that secretly monitors and collects user information.
### Phishing
Phishing is a social engineering attack that aims to steal sensitive information such as usernames, passwords, and credit card details. Attackers often masquerade as trusted entities in email or other communication channels, tricking victims into providing their information.
### Man-in-the-Middle (MitM) Attacks
MitM attacks occur when an attacker intercepts and potentially alters communication between two parties without their knowledge. This can lead to the unauthorized acquisition of sensitive information.
### Denial-of-Service (DoS) and Distributed Denial-of-Service (DDoS) Attacks
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
9. [1] Challenges
Organizations are challenged with data analysis without heavy investments and long deployment time
● Significant amount of effort required to analyze data on S3
● Users often have access to only aggregated data sets
● Managing Hadoop or data warehouse requires expertise
10. [1] Introducing AWS Athena
Athena is an interactive query service that makes it easy to
analyze data directly from AWS S3 using Standard SQL
11. [1] AWS Athena Overview
Easy to use
1. Login to a console
2. Create a table (either by following a wizard or by typing Hive DDL statement)
3. Start querying
12. [1] AWS Athena is Highly Available
High Availability Features
● You connect to a service endpoint or log into a console
● Athena uses warm compute pools across multiple availability zones
● Your data is in Amazon S3 which has 99.999999999% durability
13. [1] Querying Data Directly from Amazon S3
Direct access to your data without hassles
● No loading of data
● No ETL required
● No additional storage required
● Query of data in raw format
14. [1] Use ANSI SQL
Use of skills you probably already have
● Start with writing Standard ANSI SQL syntax
● Support for complex joins, nested queries & window functions
● Support for complex data types (arrays, structs)
● Support for partitioning of data by any key:
○ e.g. date, time, custom keys
○ Or customer-year-month-day-hour
15. [1] AWS Athena Overview
Amazon Athena is server-less way to query your data that lives on S3 using SQL
Features:
● Serverles with zero spin-up time and transparent upgrades
● Data can be stored in CSV, JSON, ORC, Parquet and even Apache web logs format
○ AVRO (coming soon)
● Compression is supported out of the box
● Queries cost $5 per terabyte of data scanned with a 10 MB minimum per query
Additional Information:
● Not a general purpose database
● Usually used by Data Analysts to run interactive queries over large datasets
● Currently available at us-east-1 (North Virginia) or the us-west-2 (Oregon)
16. [1] Underlying Technologies
Presto (originating from Facebook)
● Used for SQL queries
● In-memory distributed querying engine ANSI SQL compatible with
extensions
Hive (originating from Hadoop project)
● Used for DDL functionality
● Complex data types
● Multitude of formats
● Supports data partitioning
20. [2] Interacting with AWS Athena
Amazon Athena is server-less way to query your data that lives on S3 using SQL
Web User Interface:
● Run queries and examine results
● Manage databases and tables
● Save queries and share across organization for re-use
● Query History
JDBC Driver:
● Programmatic way to access AWS Athena
○ SQL Workbench, JetBrains DataGrip, sqlline
○ Your own app
AWS QuickSight:
● Visualize Athena data with charts, pivots and dashboards.
23. [3] Data and Compression Formats
The data formats presently supported are
● CSV
● TSV
● Parquet (Snappy is default compression)
● ORC (Zlib is default compression)
● JSON
● Apache Web Server logs (RegexSerDe)
● Custom Delimiters
Compression Formats
● Currently, Snappy, Zlib, and GZIP are the supported compression formats.
● LZO is not supported as of today
24. [3] CSV Example
CREATE EXTERNAL TABLE `mydb.yellow_trips`(
`vendor_id` string,
`pickup_datetime` timestamp,
`dropoff_datetime` timestamp,
`pickup_longitude` float,
`pickup_latitude` float,
`dropoff_longitude` float,
`dropoff_latitude` float,
`................` .....)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY ''
LINES TERMINATED BY 'n'
LOCATION 's3://nyc-yellow-trips/csv/'
27. [3] RegEx Serde (Apache Log Example)
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
Date DATE, Time STRING, Location STRING,
Bytes INT, RequestIP STRING, Method STRING,
Host STRING, Uri STRING, Status INT, Referrer STRING,
os STRING, Browser STRING, BrowserVersion STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(?!#)([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^
]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^
]+)s+[^(]+[(]([^;]+).*%20([^/]+)[/](.*)$")
LOCATION 's3://athena-examples/cloudfront/plaintext/';
28. [3] Comparing Formats
PARQUET
● Columnar format
● Schema segregation into footer
● Column major format
● All data is pushed to the leaf
● Integrated compression and indexes
● Support for predicate pushdown
ORC
● Apache Top Level Project
● Schema segregation into footer
● Column major format with stripes
● Integrated compression and indexes
and stats
● Support for predicate pushdown
30. [3] Converting to Parquet or ORC format
● You can use Hive CTAS to convert data:
CREATE TABLE new_key_value_store
STORED AS PARQUET
AS SELECT c1, c2, c3, .., cN FROM noncolumunartable
SORT BY key
● You can also use Spark to convert the files to Parquet or ORC
● 20 lines of PySpark code running on EMR [1]
○ Converts 1TB of text data into 130GB of Parquet with Snappy compression
○ Approx. cost is $5
[1] https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
31. [3] Pay By the Query ($5 per TB scanned)
● You are paying by the amount of scanned data
● Means to save on cost
○ Compress
○ Convert to columnar format
○ Use partitioning
● Free: DDL queries, failed queries
Dataset Size on S3 Query Runtime Data Scanned Cost
Logs stored as CSV 1TB 237s 1.15TB $5.75
Logs stored as PARQUET 130GB 5.13s 2.69GB $0.013
Savings 87% less 34x faster 99% less 99.7% cheaper
34. [4] Partitioning Data
By partitioning your data, you can restrict the amount of data scanned by each query, thus improving
performance and reducing cost
Benefits of Data Partitioning:
● Partitions limit the scope of data being scanned during the query
● Improves Performance
● Reduce query cost
● You can partition your data by any key
Common Practice:
● Based on time, often leading with a multi-level partitioning scheme
○ YEAR -> MONTH -> DAY -> HOUR
35. [4] Data already partitioned and stored on S3
$ aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/
PRE dt=2009-04-12-13-00/
PRE dt=2009-04-12-13-05/
PRE dt=2009-04-12-13-10/
PRE dt=2009-04-12-13-15/
PRE dt=2009-04-12-13-20/
PRE dt=2009-04-12-14-00/
PRE dt=2009-04-12-14-05/
CREATE EXTERNAL TABLE impressions (
... ...)
PARTITIONED BY (dt string)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ;
// load partitions into Athena
MSCK REPAIR TABLE impressions
// Run sample query
SELECT dt,impressionid FROM impressions WHERE dt<'2009-04-12-14-00' and dt>='2009-04-12-13-00'
38. [5] Converting to Columnar Formats (batch data)
Your Amazon Athena query performance improves if you convert your data into open source columnar
formats such as Apache Parquet or ORC.
The process for converting to columnar formats using an EMR cluster is as follows:
● Create an EMR cluster with Hive installed.
● In the step section of the cluster create statement, you can specify a script stored in Amazon S3,
which points to your input data and creates output data in the columnar format in an Amazon S3
location. In this example, the cluster auto-terminates.
39. [5] Converting to Columnar Formats (streaming data)
Your Amazon Athena query performance improves if you convert your data into open source columnar
formats such as Apache Parquet or ORC.
The process for converting to columnar formats using an EMR cluster is as follows:
● Create an EMR cluster with Spark
● Run Spark Streaming Job reading the data from Kinesis Stream and writing Parquet files on S3
41. [6] Athena Security
Amazon offers three ways to control data access:
● AWS Identity and Access Management policies
● Access Control Lists
● Amazon S3 bucket policies
Users are in control who can access data on S3. It’s possible to fine-tune security to allow different
people to see different sets of data and also to grant access to other user’s data.
43. [7] Service Limits
You can request a limit increase by contacting AWS Support.
● Currently, you can only submit one query at a time and you can only have 5 (five) concurrent
queries at one time per account.
● Query timeout: 30 minutes
● Number of databases: 100
● Table: 100 per database
● Number of partitions: 20k per table
● You may encounter a limit for Amazon S3 buckets per account, which is 100.
44. [7] Known Limitations
The following are known limitations in Amazon Athena
● User-defined functions (UDF or UDAFs) are not supported.
● Stored procedures are not supported.
● Currently, Athena does not support any transactions found in Hive or Presto. For a full list of
keywords not supported, see Unsupported DDL.
● LZO is not supported. Use Snappy instead.
45. [7] Avoid Surprises
Use backticks if table names begin with an underscore. For example:
CREATE TABLE myUnderScoreTable (
`_id` string,
`_index`string,
...
For the LOCATION clause, using a trailing slash
USE
s3://path_to_bucket/
DO NOT USE
s3://path_to_bucket
s3://path_to_bucket/*
s3://path_to_bucket/mySpecialFile.dat
47. DoIT International confidential │ Do not distribute
Google BigQuery
• Serverless Analytical Columnar Database based on Google Dremel
• Data:
• Native Tables
• External Tables (*SV, JSON, AVRO files stored in Google Cloud Storage bucket)
• Ingestion:
• File Imports
• Streaming API (up to 100K records/sec per table)
• Federated Tables (files in bucket, Bigtable table or Google Spreadsheet)
• ANSI SQL 2011
• Priced at $5/TB of scanned data + storage + streaming (if used)
• Cost Optimization - partitioning, limit queried columns, 24-hour cache, cold data.
48. DoIT International confidential │ Do not distribute
Summary
Feature Product AWS Athena Google BigQuery
Data Formats *SV, JSON, PARQUET/z, ORC/z External (*SV, JSON, AVRO) / Native
ANSI SQL Support Yes* Yes*
DDL Support Only CREATE/ALTER/DROP CREATE/UPDATE/DELETE (w/ quotas)
Underlying Technology FB Presto Google Dremel
Caching No Yes
Cold Data Pricing S3 Lifecycle Policy 50% discount after 90 days of inactivity
User Defined Functions No Yes
Data Partitioning On Any Key By DAY
Pricing $5/TB (scanned) plus S3 ops $5/TB (scanned) less cached data
49. DoIT International confidential │ Do not distribute
Test Drive Summary
Query Type AWS Athens (GB/time) Google BigQuery (GB/time) t.diff %
[1] LOOKUP 48MB (4.1s) 130GB (2.0s) - 51%
[2] LOOKUP & AGGR 331MB (4.35s) 13.4GB (2.7s) - 48%
[3] GROUP/ORDER BY 5.74GB (8.85s) 8.26GB (5.4s) - 27%
[4] TEXT FUNCTIONS 606MB (11.3s) 13.6GB (2.4s) - 470%
[5] JSON FUNCTIONS 29MB (17.8s) 63.9GB (8.9s) - 100%
[6] REGEX FUNCTIONS (1.3s) 5.45GB (1.9s) + 31%
[7] FEDERATED DATA 133GB (19.4s) 133GB (36.4s) +47%
50. DoIT International confidential │ Do not distribute
What Athena does better than BigQuery?
Advantages:
• Can be faster than BigQuery, especially with federated/external tables
• Ability to use regex to define a schema (query files without needing to change the format)
• Can be faster and cheaper than BigQuery when using a partitioned/columnar format
• Tables can be partitioned on any column
Issues:
• It’s not easy to convert data between formats
• Doesn’t support DDL, i.e. no insert/update/delete
• No built-in ingestion
51. DoIT International confidential │ Do not distribute
What BigQuery does better than Athena?
• It has native table support giving it better performance and more features
• It’s easy to manipulate data, insert/update records and write query results back to a table
• Querying native tables is very fast
• Easy to convert non-columnar formats into a native table for columnar queries
• Supports UDFs, although they will be available in the future for Athena
• Supports nested tables (nested and repeated fields)