Successfully reported this slideshow.
Your SlideShare is downloading. ×

What is AWS Glue

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
what is OSI model
what is OSI model
Loading in …3
×

Check these out next

1 of 19 Ad
Advertisement

More Related Content

More from jeetendra mandal (20)

Recently uploaded (20)

Advertisement

What is AWS Glue

  1. 1. What is AWS Glue? AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. AWS Glue is an event-driven, serverless computing platform provided by Amazon as a part of Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code. It was introduced in August 2017. The primary purpose of Glue is to scan other services in the same Virtual Private Cloud (or equivalent accessible network element even if not provided by AWS), particularly S3. The jobs are billed according to compute time, with a minimum count of 1 minute. Glue discovers the source data to store associated meta-data (e.g. the table's schema of field names, types lengths) in the AWS Glue Data Catalog (which is then accessible via AWS console or APIs).
  2. 2. What is ETL? Extract — The script will read all the usage data from the S3 bucket to a single data frame (you can think of a data frame in Pandas) Transform — Let’s say that the original data contains 10 different logs per second on average. The analytics team wants the data to be aggregated per each 1 minute with a specific logic. Load — Write the processed data back to another S3 bucket for the analytics team.
  3. 3. How AWS Glue work? AWS Glue can run your extract, transform, and load (ETL) jobs as new data arrives. For example, you can configure AWS Glue to initiate your ETL jobs to run as soon as new data becomes available in Amazon Simple Storage Service (S3).
  4. 4. How AWS Glue work? Choose your preferred data integration engine in AWS Glue to support your users and workloads.
  5. 5. How AWS Glue work? You can use the Data Catalog to quickly discover and search multiple AWS datasets without moving the data. Once the data is cataloged, it is immediately available for search and query using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
  6. 6. How AWS Glue work? AWS Glue Studio makes it easier to visually create, run, and monitor AWS Glue ETL jobs. You can build ETL jobs that move and transform data using a drag-and-drop editor, and AWS Glue automatically generates the code.
  7. 7. Use case where AWS Glue fits? Simplify ETL pipeline development Remove infrastructure management with automatic provisioning and worker management, and consolidate all your data integration needs into a single service. Discover data efficiently Quickly identify data across multiple AWS datasets, and then make it instantly available for querying and transforming. Interactively explore, experiment on, and process data Using AWS Glue interactive sessions, data engineers can interactively explore and prepare data using the integrated development environment (IDE) or notebook of their choice. Support various processing frameworks and workloads More easily support various data processing frameworks, such as ETL and ELT, and various workloads, including batch, micro-batch, and streaming.
  8. 8. What is AWS Glue Studio? AWS Glue Studio is a graphical interface that makes it easy to create, run, and monitor data integration jobs in AWS Glue. You can visually compose data transformation workflows and seamlessly run them on the Apache Spark–based serverless ETL engine in AWS Glue. For more information, see What is AWS Glue Studio. With AWS Glue Studio, you can create and manage jobs that gather, transform, and clean data. You can also use AWS Glue Studio to troubleshoot and edit job scripts. AWS Glue features AWS Glue features fall into three major categories: •Discover and organize data •Transform, prepare, and clean data for analysis •Build and monitor data pipelines
  9. 9. How to Access AWS Glue ? You can create, view, and manage your AWS Glue jobs using the following interfaces: •AWS Glue console – Provides a web interface for you to create, view, and manage your AWS Glue jobs. •AWS Glue Studio – Provides a graphical interface for you to create and edit your AWS Glue jobs visually. •AWS Glue section of the AWS CLI Reference – Provides AWS CLI commands that you can use with AWS Glue. •AWS Glue API – Provides a complete API reference for developers.
  10. 10. AWS Glue console You use the AWS Glue console to define and orchestrate your ETL workflow. The console calls several API operations in the AWS Glue Data Catalog and AWS Glue Jobs system to perform the following tasks: •Define AWS Glue objects such as jobs, tables, crawlers, and connections. •Schedule when crawlers run. •Define events or schedules for job triggers. •Search and filter lists of AWS Glue objects. •Edit transformation scripts.
  11. 11. AWS Glue Studio AWS Glue Studio is a new graphical interface that makes it easy to create, run, and monitor extract, transform, and load (ETL) jobs in AWS Glue. You can visually compose data transformation workflows and seamlessly run them on AWS Glue’s Apache Spark-based serverless ETL engine.
  12. 12. Streaming ETL in AWS Glue AWS Glue enables you to perform ETL operations on streaming data using continuously-running jobs. AWS Glue streaming ETL is built on the Apache Spark Structured Streaming engine, and can ingest streams from Amazon Kinesis Data Streams, Apache Kafka, and Amazon Managed Streaming for Apache Kafka (Amazon MSK). Streaming ETL can clean and transform streaming data and load it into Amazon S3 or JDBC data stores. Use Streaming ETL in AWS Glue to process event data like IoT streams, clickstreams, and network logs.
  13. 13. The AWS Glue jobs system The AWS Glue Jobs system provides managed infrastructure to orchestrate your ETL workflow. You can create jobs in AWS Glue that automate the scripts you use to extract, transform, and transfer data to different locations. Jobs can be scheduled and chained, or they can be triggered by events such as the arrival of new data.
  14. 14. Serverless ETL jobs run in isolation AWS Glue runs your ETL jobs in an Apache Spark serverless environment. AWS Glue runs these jobs on virtual resources that it provisions and manages in its own service account. AWS Glue is designed to do the following: •Segregate customer data. •Protect customer data in transit and at rest. •Access customer data only as needed in response to customer requests, using temporary, scoped-down credentials, or with a customer's consent to IAM roles in their account.
  15. 15. Data sources and destinations AWS Glue allows you to read and write data from multiple systems and databases including: •Amazon S3 •Amazon DynamoDB •Amazon Redshift •Amazon Relational Database Service (Amazon RDS) •Third-party JDBC-accessible databases •MongoDB and Amazon DocumentDB (with MongoDB compatibility) •Other marketplace connectors and Apache Spark plugins Data streams AWS Glue can stream data from the following systems: •Amazon Kinesis Data Streams •Apache Kafka AWS Glue is available in several AWS Regions.
  16. 16. Components of AWS Glue •Data catalog: The data catalog holds the metadata and the structure of the data. •Database: It is used to create or access the database for the sources and targets. •Table: Create one or more tables in the database that can be used by the source and target. •Crawler and Classifier: A crawler is used to retrieve data from the source using built-in or custom classifiers. It creates/uses metadata tables that are pre-defined in the data catalog. •Job: A job is business logic that carries out an ETL task. Internally, Apache Spark with python or scala language writes this business logic. •Trigger: A trigger starts the ETL job execution on-demand or at a specific time. •Development endpoint: It creates a development environment where the ETL job script can be tested, developed, and debugged.
  17. 17. Summary AWS Glue makes it easy to integrate data across your architecture. It integrates with AWS analytics services and Amazon S3 data lakes. AWS Glue has integration interfaces and job-authoring tools that are easy to use for all users, from developers to business users, with tailored solutions for varied technical skill sets. With the ability to scale on demand, AWS Glue helps you focus on high-value activities that maximize the value of your data. It scales for any data size, and supports all data types and schema variances. To increase agility and optimize costs, AWS Glue provides built-in high availability and pay-as-you-go billing. AWS Glue consolidates major data integration capabilities into a single service. These include data discovery, modern ETL, cleansing, transforming, and centralized cataloging. It's also serverless, which means there's no infrastructure to manage. With flexible support for all workloads like ETL, ELT, and streaming in one service, AWS Glue supports users across various workloads and types of users.
  18. 18. THANK YOU Like the Video and Subscribe the Channel

×