SlideShare a Scribd company logo
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rahul Bhartia, Principal Solutions Architect, APN
Andy Kimbrough, Sr. Mgr. of Engineering, Amazon S3
Paul Scott-Murphy, VP of Product Mgmt., WANdisco
Extending Hadoop and Spark to the AWS
Cloud
with AWS Technology Partners
What to Expect from the Session
• Learn how to easily and seamless transition or extend Hadoop and Spark in
AWS
• Patterns for migrating data from Hadoop clusters to Amazon S3
• Learn about solutions offered by AWS Big Data Technology Competency
Partner solutions.
• Automated deployment of Partner solutions on AWS cloud for minimal
disruptions
Reduce Costs Increase Speed Innovations
Big Data workloads in the Cloud
• Optimize infrastructure for the
workload
• Decouple Compute from storage
GE Oil & Gas is migrating 500 applications, and more than 750TB of data, to
the cloud by the end of 2016 as part of a major digital transformation, helping it
attain a 52% reduction in TCO and greater speed to market.
• Launch resources as needed without
any planning
• Provide self-services for users
• Test new ideas, new frameworks
without any commitment
• Bring new products to markets
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
• Multipart upload
• Range GET
• Parallelize List
• Store as much as you need
• Scale storage and compute
independently
• No minimum usage commitments
Scalable
• AWS Elastic MapReduce
• Amazon Redshift
• Amazon DynamoDB
• Spark, Hive, Impala, Presto
• Many others
Integrated
• Simple REST API
• AWS SDKs
• Read-after-create consistency
• Event Notification
• Lifecycle policies
Easy to use
Amazon S3 – Storage for Big data
Cross-Region
Replication Lifecycle
Policy
Data
Classification
& Management
Event
Notifications
CloudWatch Metrics S3 Inventory Audit with CloudTrail
Data Events
Storage
Analytics
S3 – Storage Management for S3
Amazon S3 with Big Data workloads
• EMRFS with Amazon EMR
• Open source Hadoop/Spark connector (S3A)
• Consistency - S3Guard
• Performance - Lazy Seek, Connection re-use
• AWS SDK – Multi- part
• Other open-source integrations
• HUE, Alluxio, Presto
Migrating Big Data workloads to the Cloud
HDFS
Application
HDFS
Application
Input Output
Backup
Input
Output
Copy
Application
Lift-and-Shift Burst-or-Extend
Getting your data to Cloud
(and back …)
One Time Periodic Continuous
Patterns for migrating data from Hadoop
• AWS SnowBall with HDFS
Interface
• AWS Import/Export
• Amazon EMR with s3-
dist-cp
• AWS S3 APIs
AWS Technology Partners
• Amazon Kinesis with
Streams and Firehose
• Amazon DMS
AWS SnowBall with HDFS Interface! (NEW)
$ snowball cp -n
hdfs://HOST:PORT/PATH_TO_FILE_ON_HDFS
s3://BUCKET-NAME/DESTINATION-PATH
Distributed Copy
Works like a MapReduce job with Amazon S3 as a target
Best for periodic data backups
s3DistCp
--src s3://mybucket/prefix
--dest hdfs:///folders
--srcPattern patternOn-prem
Cluster
Amazon S3
bucket
Distributed Copy
• Workflow management - Apache Falcon
• Connectivity is the key - AWS DirectConnect
• Remember
• Dealing with Kerberos authentications (across Cluster)
• Needs a scheduled workflow management
• Can easily saturate the bandwidth
• Needs compute for moving data
WANDisco Fusion for Hadoop
Best for synchronization of data
DEMO
Replication to Amazon S3 with WANDisco
Paul Scott-Murphy
VP of Product Management
WANDisco
WANDisco Fusion for Hadoop
Advantages
• No extra compute your Cluster
• No management or workflow required
• Available via AWS Marketplace
Learn more at https://www.wandisco.com/product/amazon-
s3-active-migrator
Stop by booth #2524
Data beyond Hadoop
Collect Logs/Events/Streams
Replicate Relational Databases
Get up and running in AWS
Moving to AWS is more than just “lift and shift”
Big Data Solution running on a
NON-AWS Cloud environment
• Rigid / un-flexible
• Low utilization
• High cost
Big Data Solution running on AWS
• Flexible – scale up / down in minutes
• Size to your needs in less than 1 hour
• Constantly optimize cost: price reductions
+ innovations
Step 1: Migrate Data
Step 2: Process
Step 3: Leverage
Get started easily
AWS Big Data Competency Partners
Hortonworks Data Cloud AWS MarketplaceAWS Quickstart
Support for usage based pricing model with data in Amazon S3
Compliment the functionality with managed services or clusters
Symantec: Provisioning Big Data Platform
http://www.slideshare.net/HadoopSummit/provisioning-big-data-platform-using-cloudbreak-ambari
Leverage Amazon S3 with what you prefer
• Hive/LLAP with Amazon S3 - http://hortonworks.com/blog/llap-enables-sub-
second-sql-hadoop/
• Impala with Amazon S3 -
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_s3.html
• Drill with Amazon S3 - https://www.mapr.com/resources/videos/sql-queries-data-
amazon-s3-storage-drill-demo
• Databricks FileSystem -
https://docs.cloud.databricks.com/docs/latest/databricks_guide/01%20Databricks
%20Overview/10%20Databricks%20File%20System%20-%20DBFS.html
• Vertica External Flex Tables - https://community.dev.hpe.com/t5/Vertica-
Blog/Automatic-HP-Vertica-Database-Loader-for-AWS-S3/ba-p/230344
Dataricks FileSystem
DBFS is a distributed file system that comes installed on
Spark Clusters in Databricks. It is a layer over S3, which
allows you to:
• Mount S3 buckets to make them available to users in
your workspace
• Cache S3 data on the solid-state disks (SSDs) of your
worker nodes to speed up access.
Homeaway
HomeAway replaced its homegrown environment with Databricks to simplify
the management of their Spark infrastructure through its native access to S3,
interactive notebooks, and cluster management capabilities. With
Databricks, the productivity of their data science team increased dramatically,
allowing them to spend more time on rapid prototyping and asking more
questions of their data.
Thank you!
Remember to complete
your evaluations!

More Related Content

What's hot

What's hot (20)

AWS re:Invent 2016: Getting the most Bang for your buck with #EC2 #Winning (C...
AWS re:Invent 2016: Getting the most Bang for your buck with #EC2 #Winning (C...AWS re:Invent 2016: Getting the most Bang for your buck with #EC2 #Winning (C...
AWS re:Invent 2016: Getting the most Bang for your buck with #EC2 #Winning (C...
 
AWS re:Invent 2016: Design, Deploy, and Optimize Microsoft SharePoint on AWS ...
AWS re:Invent 2016: Design, Deploy, and Optimize Microsoft SharePoint on AWS ...AWS re:Invent 2016: Design, Deploy, and Optimize Microsoft SharePoint on AWS ...
AWS re:Invent 2016: Design, Deploy, and Optimize Microsoft SharePoint on AWS ...
 
AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...
AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...
AWS re:Invent 2016: Bring Microsoft Applications to AWS to Save Money and Sta...
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
AWS re:Invent 2016: Datapipe Open Source: Image Development Pipeline (ARC319)
AWS re:Invent 2016: Datapipe Open Source:  Image Development Pipeline (ARC319)AWS re:Invent 2016: Datapipe Open Source:  Image Development Pipeline (ARC319)
AWS re:Invent 2016: Datapipe Open Source: Image Development Pipeline (ARC319)
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
 
Cloud Economics, from Genesis to Scale
Cloud Economics, from Genesis to ScaleCloud Economics, from Genesis to Scale
Cloud Economics, from Genesis to Scale
 
Getting Started: Optimizing your SAP landscape in the Cloud-SAPPHIRE NOW 2016
Getting Started: Optimizing your SAP landscape in the Cloud-SAPPHIRE NOW 2016Getting Started: Optimizing your SAP landscape in the Cloud-SAPPHIRE NOW 2016
Getting Started: Optimizing your SAP landscape in the Cloud-SAPPHIRE NOW 2016
 
Migrating your Databases to Aurora - AWS April 2016 Webinar Series
Migrating your Databases to Aurora - AWS April 2016 Webinar Series Migrating your Databases to Aurora - AWS April 2016 Webinar Series
Migrating your Databases to Aurora - AWS April 2016 Webinar Series
 
Deep Dive On Amazon Redshift
Deep Dive On Amazon RedshiftDeep Dive On Amazon Redshift
Deep Dive On Amazon Redshift
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
 
ENT306 Migrating Large Scale Data Sets to the Cloud
ENT306 Migrating Large Scale Data Sets to the CloudENT306 Migrating Large Scale Data Sets to the Cloud
ENT306 Migrating Large Scale Data Sets to the Cloud
 
Evolution of Geospatial Workloads on AWS - AWS PS Summit Canberra
Evolution of Geospatial Workloads on AWS - AWS PS Summit Canberra Evolution of Geospatial Workloads on AWS - AWS PS Summit Canberra
Evolution of Geospatial Workloads on AWS - AWS PS Summit Canberra
 
The Value of Certified AWS Experts to Your Business
The Value of Certified AWS Experts to Your BusinessThe Value of Certified AWS Experts to Your Business
The Value of Certified AWS Experts to Your Business
 
Managing Data with Amazon ElastiCache for Redis - August 2016 Monthly Webinar...
Managing Data with Amazon ElastiCache for Redis - August 2016 Monthly Webinar...Managing Data with Amazon ElastiCache for Redis - August 2016 Monthly Webinar...
Managing Data with Amazon ElastiCache for Redis - August 2016 Monthly Webinar...
 
AWS-Enabled Disaster Recovery and Business Continuity for SIFIs
AWS-Enabled Disaster Recovery and Business Continuity for SIFIsAWS-Enabled Disaster Recovery and Business Continuity for SIFIs
AWS-Enabled Disaster Recovery and Business Continuity for SIFIs
 
Building Big Data Applications on AWS
Building Big Data Applications on AWSBuilding Big Data Applications on AWS
Building Big Data Applications on AWS
 
Cost Optimising Your Architecture Practical Design Steps for Developer Saving...
Cost Optimising Your Architecture Practical Design Steps for Developer Saving...Cost Optimising Your Architecture Practical Design Steps for Developer Saving...
Cost Optimising Your Architecture Practical Design Steps for Developer Saving...
 
SMC302 Building Serverless Web Applications
SMC302 Building Serverless Web ApplicationsSMC302 Building Serverless Web Applications
SMC302 Building Serverless Web Applications
 
AWS Innovate: Running Databases in AWS- Russell Nash
AWS Innovate: Running Databases in AWS- Russell NashAWS Innovate: Running Databases in AWS- Russell Nash
AWS Innovate: Running Databases in AWS- Russell Nash
 

Viewers also liked

Viewers also liked (20)

S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?
 
AWS re:Invent 2016: Elastic Load Balancing Deep Dive and Best Practices (NET403)
AWS re:Invent 2016: Elastic Load Balancing Deep Dive and Best Practices (NET403)AWS re:Invent 2016: Elastic Load Balancing Deep Dive and Best Practices (NET403)
AWS re:Invent 2016: Elastic Load Balancing Deep Dive and Best Practices (NET403)
 
AWS re:Invent 2016: Hybrid Architecture Design: Connecting Your On-Premises W...
AWS re:Invent 2016: Hybrid Architecture Design: Connecting Your On-Premises W...AWS re:Invent 2016: Hybrid Architecture Design: Connecting Your On-Premises W...
AWS re:Invent 2016: Hybrid Architecture Design: Connecting Your On-Premises W...
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
 
AWS re:Invent 2016: Add User Sign-In, User Management, and Security to your M...
AWS re:Invent 2016: Add User Sign-In, User Management, and Security to your M...AWS re:Invent 2016: Add User Sign-In, User Management, and Security to your M...
AWS re:Invent 2016: Add User Sign-In, User Management, and Security to your M...
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
 
AWS re:Invent 2016: [JK REPEAT] Serverless Architectural Patterns and Best Pr...
AWS re:Invent 2016: [JK REPEAT] Serverless Architectural Patterns and Best Pr...AWS re:Invent 2016: [JK REPEAT] Serverless Architectural Patterns and Best Pr...
AWS re:Invent 2016: [JK REPEAT] Serverless Architectural Patterns and Best Pr...
 
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
 
AWS re:Invent 2016: Deep Dive: Building and Delivering Mobile Apps for the En...
AWS re:Invent 2016: Deep Dive: Building and Delivering Mobile Apps for the En...AWS re:Invent 2016: Deep Dive: Building and Delivering Mobile Apps for the En...
AWS re:Invent 2016: Deep Dive: Building and Delivering Mobile Apps for the En...
 
AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...
AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...
AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...
 
AWS re:Invent 2016: Using MXNet for Recommendation Modeling at Scale (MAC306)
AWS re:Invent 2016: Using MXNet for Recommendation Modeling at Scale (MAC306)AWS re:Invent 2016: Using MXNet for Recommendation Modeling at Scale (MAC306)
AWS re:Invent 2016: Using MXNet for Recommendation Modeling at Scale (MAC306)
 
AWS re:Invent 2016: IAM Best Practices to Live By (SAC317)
AWS re:Invent 2016: IAM Best Practices to Live By (SAC317)AWS re:Invent 2016: IAM Best Practices to Live By (SAC317)
AWS re:Invent 2016: IAM Best Practices to Live By (SAC317)
 
AWS re:Invent 2016: Real-time Data Processing Using AWS Lambda (SVR301)
AWS re:Invent 2016: Real-time Data Processing Using AWS Lambda (SVR301)AWS re:Invent 2016: Real-time Data Processing Using AWS Lambda (SVR301)
AWS re:Invent 2016: Real-time Data Processing Using AWS Lambda (SVR301)
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
 
AWS re:Invent 2016: Searching Inside Video at Petabyte Scale Using Spot (WIN307)
AWS re:Invent 2016: Searching Inside Video at Petabyte Scale Using Spot (WIN307)AWS re:Invent 2016: Searching Inside Video at Petabyte Scale Using Spot (WIN307)
AWS re:Invent 2016: Searching Inside Video at Petabyte Scale Using Spot (WIN307)
 
AWS re:Invent 2016: Life Without SSH: Immutable Infrastructure in Production ...
AWS re:Invent 2016: Life Without SSH: Immutable Infrastructure in Production ...AWS re:Invent 2016: Life Without SSH: Immutable Infrastructure in Production ...
AWS re:Invent 2016: Life Without SSH: Immutable Infrastructure in Production ...
 
AWS re:Invent 2016: From Monolithic to Microservices: Evolving Architecture P...
AWS re:Invent 2016: From Monolithic to Microservices: Evolving Architecture P...AWS re:Invent 2016: From Monolithic to Microservices: Evolving Architecture P...
AWS re:Invent 2016: From Monolithic to Microservices: Evolving Architecture P...
 
AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis...
AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis...AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis...
AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis...
 

Similar to AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)

Journey Through the Cloud - Data Analysis
Journey Through the Cloud - Data AnalysisJourney Through the Cloud - Data Analysis
Journey Through the Cloud - Data Analysis
Amazon Web Services
 

Similar to AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304) (20)

Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
 
Data Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptxData Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptx
 
Migrating Large Scale Datasets
Migrating Large Scale DatasetsMigrating Large Scale Datasets
Migrating Large Scale Datasets
 
Building compelling Enterprise Solutions on AWS
Building compelling Enterprise Solutions on AWSBuilding compelling Enterprise Solutions on AWS
Building compelling Enterprise Solutions on AWS
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS Cloud
 
AWS 資料湖服務
AWS 資料湖服務AWS 資料湖服務
AWS 資料湖服務
 
database migration simple, cross-engine and cross-platform migrations with ...
database migration   simple, cross-engine and cross-platform migrations with ...database migration   simple, cross-engine and cross-platform migrations with ...
database migration simple, cross-engine and cross-platform migrations with ...
 
Digital Media Ingest and Storage Options on AWS
Digital Media Ingest and Storage Options on AWSDigital Media Ingest and Storage Options on AWS
Digital Media Ingest and Storage Options on AWS
 
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech Talks
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech TalksMigrating Your Oracle Database to PostgreSQL - AWS Online Tech Talks
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech Talks
 
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech Talks
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech TalksMigrating Your Oracle Database to PostgreSQL - AWS Online Tech Talks
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech Talks
 
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using AlluxioBursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
 
New Database Migration Services & RDS Updates
New Database Migration Services & RDS UpdatesNew Database Migration Services & RDS Updates
New Database Migration Services & RDS Updates
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWSAWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
 
Transitioning to the Next Generation Hybrid Cloud Operating Model
Transitioning to the Next Generation Hybrid Cloud Operating ModelTransitioning to the Next Generation Hybrid Cloud Operating Model
Transitioning to the Next Generation Hybrid Cloud Operating Model
 
AWS / CAPSiDE - Intro - AWSome Day - Barcelona 2014
AWS / CAPSiDE - Intro - AWSome Day - Barcelona 2014AWS / CAPSiDE - Intro - AWSome Day - Barcelona 2014
AWS / CAPSiDE - Intro - AWSome Day - Barcelona 2014
 
Cloud First: New Architecture for New Infrastructure
Cloud First: New Architecture for New InfrastructureCloud First: New Architecture for New Infrastructure
Cloud First: New Architecture for New Infrastructure
 
Journey Through the Cloud - Data Analysis
Journey Through the Cloud - Data AnalysisJourney Through the Cloud - Data Analysis
Journey Through the Cloud - Data Analysis
 
Data Analysis - Journey Through the Cloud
Data Analysis - Journey Through the CloudData Analysis - Journey Through the Cloud
Data Analysis - Journey Through the Cloud
 
AWS in Media: Cloud and Serverless Architectures
AWS in Media: Cloud and Serverless ArchitecturesAWS in Media: Cloud and Serverless Architectures
AWS in Media: Cloud and Serverless Architectures
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Recently uploaded (20)

Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 

AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rahul Bhartia, Principal Solutions Architect, APN Andy Kimbrough, Sr. Mgr. of Engineering, Amazon S3 Paul Scott-Murphy, VP of Product Mgmt., WANdisco Extending Hadoop and Spark to the AWS Cloud with AWS Technology Partners
  • 2. What to Expect from the Session • Learn how to easily and seamless transition or extend Hadoop and Spark in AWS • Patterns for migrating data from Hadoop clusters to Amazon S3 • Learn about solutions offered by AWS Big Data Technology Competency Partner solutions. • Automated deployment of Partner solutions on AWS cloud for minimal disruptions
  • 3. Reduce Costs Increase Speed Innovations Big Data workloads in the Cloud • Optimize infrastructure for the workload • Decouple Compute from storage GE Oil & Gas is migrating 500 applications, and more than 750TB of data, to the cloud by the end of 2016 as part of a major digital transformation, helping it attain a 52% reduction in TCO and greater speed to market. • Launch resources as needed without any planning • Provide self-services for users • Test new ideas, new frameworks without any commitment • Bring new products to markets
  • 4. Designed for 11 9s of durability Designed for 99.99% availability Durable Available High performance • Multipart upload • Range GET • Parallelize List • Store as much as you need • Scale storage and compute independently • No minimum usage commitments Scalable • AWS Elastic MapReduce • Amazon Redshift • Amazon DynamoDB • Spark, Hive, Impala, Presto • Many others Integrated • Simple REST API • AWS SDKs • Read-after-create consistency • Event Notification • Lifecycle policies Easy to use Amazon S3 – Storage for Big data
  • 5. Cross-Region Replication Lifecycle Policy Data Classification & Management Event Notifications CloudWatch Metrics S3 Inventory Audit with CloudTrail Data Events Storage Analytics S3 – Storage Management for S3
  • 6. Amazon S3 with Big Data workloads • EMRFS with Amazon EMR • Open source Hadoop/Spark connector (S3A) • Consistency - S3Guard • Performance - Lazy Seek, Connection re-use • AWS SDK – Multi- part • Other open-source integrations • HUE, Alluxio, Presto
  • 7. Migrating Big Data workloads to the Cloud HDFS Application HDFS Application Input Output Backup Input Output Copy Application Lift-and-Shift Burst-or-Extend
  • 8. Getting your data to Cloud (and back …)
  • 9. One Time Periodic Continuous Patterns for migrating data from Hadoop • AWS SnowBall with HDFS Interface • AWS Import/Export • Amazon EMR with s3- dist-cp • AWS S3 APIs AWS Technology Partners • Amazon Kinesis with Streams and Firehose • Amazon DMS
  • 10. AWS SnowBall with HDFS Interface! (NEW) $ snowball cp -n hdfs://HOST:PORT/PATH_TO_FILE_ON_HDFS s3://BUCKET-NAME/DESTINATION-PATH
  • 11. Distributed Copy Works like a MapReduce job with Amazon S3 as a target Best for periodic data backups s3DistCp --src s3://mybucket/prefix --dest hdfs:///folders --srcPattern patternOn-prem Cluster Amazon S3 bucket
  • 12. Distributed Copy • Workflow management - Apache Falcon • Connectivity is the key - AWS DirectConnect • Remember • Dealing with Kerberos authentications (across Cluster) • Needs a scheduled workflow management • Can easily saturate the bandwidth • Needs compute for moving data
  • 13. WANDisco Fusion for Hadoop Best for synchronization of data
  • 14. DEMO Replication to Amazon S3 with WANDisco Paul Scott-Murphy VP of Product Management WANDisco
  • 15. WANDisco Fusion for Hadoop Advantages • No extra compute your Cluster • No management or workflow required • Available via AWS Marketplace Learn more at https://www.wandisco.com/product/amazon- s3-active-migrator Stop by booth #2524
  • 16. Data beyond Hadoop Collect Logs/Events/Streams Replicate Relational Databases
  • 17. Get up and running in AWS
  • 18. Moving to AWS is more than just “lift and shift” Big Data Solution running on a NON-AWS Cloud environment • Rigid / un-flexible • Low utilization • High cost Big Data Solution running on AWS • Flexible – scale up / down in minutes • Size to your needs in less than 1 hour • Constantly optimize cost: price reductions + innovations Step 1: Migrate Data Step 2: Process Step 3: Leverage
  • 20. AWS Big Data Competency Partners Hortonworks Data Cloud AWS MarketplaceAWS Quickstart Support for usage based pricing model with data in Amazon S3 Compliment the functionality with managed services or clusters
  • 21. Symantec: Provisioning Big Data Platform http://www.slideshare.net/HadoopSummit/provisioning-big-data-platform-using-cloudbreak-ambari
  • 22. Leverage Amazon S3 with what you prefer • Hive/LLAP with Amazon S3 - http://hortonworks.com/blog/llap-enables-sub- second-sql-hadoop/ • Impala with Amazon S3 - https://www.cloudera.com/documentation/enterprise/latest/topics/impala_s3.html • Drill with Amazon S3 - https://www.mapr.com/resources/videos/sql-queries-data- amazon-s3-storage-drill-demo • Databricks FileSystem - https://docs.cloud.databricks.com/docs/latest/databricks_guide/01%20Databricks %20Overview/10%20Databricks%20File%20System%20-%20DBFS.html • Vertica External Flex Tables - https://community.dev.hpe.com/t5/Vertica- Blog/Automatic-HP-Vertica-Database-Loader-for-AWS-S3/ba-p/230344
  • 23. Dataricks FileSystem DBFS is a distributed file system that comes installed on Spark Clusters in Databricks. It is a layer over S3, which allows you to: • Mount S3 buckets to make them available to users in your workspace • Cache S3 data on the solid-state disks (SSDs) of your worker nodes to speed up access.
  • 24. Homeaway HomeAway replaced its homegrown environment with Databricks to simplify the management of their Spark infrastructure through its native access to S3, interactive notebooks, and cluster management capabilities. With Databricks, the productivity of their data science team increased dramatically, allowing them to spend more time on rapid prototyping and asking more questions of their data.