SlideShare a Scribd company logo
scale to infinityBig Data constraintsstrengths or limitationsHadoop ecosystem real-time analyticsBig Data partner solutionsworkflow automation
Building the Square Kilometer Array (SKA) -the Biggest Radio Telescope 
SKA will process as much data every day as the world currently produces in a year 
Using AWS and crowd-sourced CPUs to analyze 400-500 galaxies simultaneously
Mobile / Cable Telecom 
Oil & Gas Industrial Manufacturing 
Retail/Consumer Entertainment Hospitality 
Life Sciences Scientific Exploration 
Financial Services 
Publishing Media Advertising 
Online Media Social Network Gaming
Unstructured data growth explosive, with estimates of compound annual growth (CAGR) at 62% 
Source: IDC 
Data volume -Gap 
1990 
2000 
2010 
2020 
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares 
Available for analysis 
Generated data
Remove Constraints 
100 instances 
x 1 hour
Remove Constraints 
1000 instances 
x 1 hour
No upfrontcapital 
On-demandservices 
Elasticand scalable 
+ 
+ 
Pay for what you use 
+ 
= 
AWS removes constraintsRemove Constraints
Big Data Constraints 
•Volume: massive datasets 
•Variety: requiring new tools 
•Velocity: iterative, experimental data manipulation and analysis 
•Time to results: more critical than absolute performance 
AWS Cloud Computing 
•Virtually unlimited resources 
•Variety of compute solutions 
•Iterative, experimental usage/ deployment of infrastructure 
•Get faster results with effective parallel autonomous projects
One tool to rule them all
Foundation Services 
Storage 
(Object, Block and Archive) 
Networking 
Security & Access Control 
Compute 
(VMs, Auto-scaling and Load Balancing) 
Infrastructure 
Regions 
Availability Zones 
CDN and Points of Presence 
Platform Services 
Databases 
Relational 
NoSQL 
Caching 
Analytics 
Hadoop 
Real-time 
Data warehouse 
App Services 
Queuing 
Orchestration 
App streaming 
Transcoding 
Email 
Search 
Deployment & Management 
Containers 
Dev/ops Tools 
Resource Templates 
Mobile Services 
Identity 
Sync 
Mobile Analytics 
Notifications 
Data Workflows 
Usage Tracking 
Monitoring and Logs 
Enterprise 
Applications 
Virtual Desktops 
Collaboration and Sharing
Courtesy: http://techblog.netflix.com/2013/01/hadoop- platform-as-service-in-cloud.htmlHDFSYARNMapReduce
EMR Cluster 
S3 
Choose: Hadoop distribution, # of nodes, types of nodes, custom configs, Hive/Pig/etc. 
Getting Started: http://docs.aws.amazon.com/gettingstarted/latest/emr/getting-started-emr-overview.html 
Put the data into Amazon S3 
Launch the cluster using the console, CLI, SDK, or API 
You can easily add and remove nodes 
You can also store everything in HDFS 
Get the output from Amazon S3
References: 
http://aws.amazon.com/elasticmapreduce/getting-started/ 
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.htmlHadoop 2.4.0 
Hive, Pig, HBase, Impala, Gangliaencryption 
Consistent view on every cluster node
Basic statistics are suitable for Hadoop 
Some other Big Data problems are not suitable for Hadoop: dependenciessplits are interrelatedaccess data across splitsiterative computations 
Courtesy: http://www.amazon.com/Big-Data-Analytics-Beyond-Hadoop/dp/0133837947/
•Accumulo–cell-based access control NoSQL 
•Avro –data serialization system 
•Cascading –alternative language APIs on MR 
•Cassandra –multi-master NoSQL DB 
•Chukwa–data collection system at scale 
•Flume –collecting, aggregating, moving logs 
•Giraph–iterative graph processing system 
•HBase–large table NoSQL DB 
•HDFS –distributed file system 
•Hive –SQL on MapReduceData Warehouse 
•Mahout –scalable machine learning library 
•MapReduce–parallel processing on YARN 
•Nutch–web crawler software 
•Pig –high-level scripting on MapReduce 
•R -statistical computing and graphics 
•Spark –general compute engine on YARN 
•Sqoop–transferring data to/from RDBMS 
•Tez–data-flow programming on YARN 
•Thrift –build scalable cross-language services 
•ZooKeeper–high-performance coordination 
Courtesy: http://www.apache.org/
•Accumulo–cell-based access control NoSQL 
•Avro –data serialization system 
•Cascading –alternative language APIs on MR 
•Cassandra –multi-master NoSQL DB 
•Chukwa–data collection system at scale 
•Flume –collecting, aggregating, moving logs 
•Giraph–iterative graph processing system 
•HBase–large table NoSQL DB 
•HDFS –distributed file system 
•Hive –SQL on MapReduceData Warehouse 
•Mahout –scalable machine learning library 
•MapReduce–parallel processing on YARN 
•Nutch–web crawler software 
•Pig –high-level scripting on MapReduce 
•R -statistical computing and graphics 
•Spark –general compute engine on YARN 
•Sqoop–transferring data to/from RDBMS 
•Tez–data-flow programming on YARN 
•Thrift –build scalable cross-language services 
•ZooKeeper–high-performance coordination 
Courtesy: http://www.apache.org/
scriptingstatistical analysis mixture of paradigmssingle-machine, single-thread 
Hadoop offers a path to scale R computation to distributed systems 
Courtesy: http://www.r-project.org/ 
http://www.amazon.com/Learning-R-Richard-Cotton/dp/1449357105/
Ron every nodeHadoop Streaming Revolution Analytics RHadoop 
rmrmapreduce() 
rhdfs 
rhbaseRStudiohttp://blogs.aws.amazon.com/bigdata/post/Tx37RSKRFDQNTSL/Statistical-Analysis-with-Open-Source-R- and-RStudio-on-Amazon-EMR 
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UseCase_Streaming.html 
References:
scalable machine learning library 
Collaborative filtering (recommender engines) e.g. for movies, books, etc. based on comparing user preferences 
Clustering (unsupervised learning) e.g. identify groupings of related news stories based on input data properties 
Classification (supervised learning or predictive analytics) –e.g. spam filtering based on training spam data 
Courtesy: http://mahout.apache.org
Mahout 
http://mahout.apache.org/users/classification/twenty-newsgroups.html 
http://blogs.aws.amazon.com/bigdata/post/Tx1TDK3HHBD4EZL/Building-a-Recommender-with-Apache- Mahout-on-Amazon-Elastic-MapReduce-EMR 
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html 
References:
Developed by Yahoo! based on Google Pregel(page rank) 
Customized by Facebook to scale on the full friendship graph (~1B vertices and ~ 100B edges) 
Single-vertex-centric API 
Bulk Synchronous Parallel machine 
Zookeeper enforced atomic barrier 
Iterations performed in memory 
Runs in mappers, or native YARN 
Courtesy: http://giraph.apache.org 
http://giraph.apache.org/pagerank.html 
https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920 
Single Source Shortest Path Example 
values sent as messages (blue) 
BSP superstepvertex updates (red)
configure-hadoopApache ZookeeperGiraphhttp://git-wip-us.apache.org/repos/asf/giraph.git 
Maven 3 JAR file Giraphjar http://giraph.apache.org/apidocs/org/apache/giraph/examples/SimplePageRankComputation.html 
http://giraph.apache.org/quick_start.html 
http://giraph.apache.org/build.html 
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html 
References:
Getting Started: http://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html 
Leader node (SQL clients, BI tools access) 
PostgreSQL endpoint 
Stores metadata 
Coordinates queries 
Ingestion 
Backup 
Restore 
Amazon S3 
128GB RAM 
16TB disk 
16 cores 
Compute Node 
128GB RAM 
16TB disk 
16 cores 
Compute Node 
128GB RAM 
16TB disk 
16 cores 
Compute Node 
10 GigE 
(HPC) 
SQL Clients/BI Tools 
128GB RAM 
16TB disk 
16 cores 
JDBC/ODBC 
LeaderNode 
Compute nodes 
Local, columnar storage 
Execute queries in parallel 
Amazon S3 load, backup/restore 
Integration with Amazon DynamoDB, EMR, Kinesis
JDBC/ODBC 
Connect using drivers from PostgreSQL.org 
Amazon Redshift
2 years old all over the world1900 products 200 allow BYOL BI tools
MongoDBon AWS Architecture Whitepaper 
Running MongoDB on Amazon EC2 
Can easily launch a multi-node replica set 
Keep JSON templates in source control 
https://mongodb-documentation.readthedocs.org/en/latest/ecosystem/tutorial/automate- deployment-with-cloudformation.html 
AWS CloudFormationJSON Templates 
AMI in AWS Marketplace 
No extra cost
Running ClouderaEDH on Amazon EC2 
Clouderaon AWS Product Brief 
ClouderaEnterprise Data Hub on AWS 
Deploy via ClouderaDirector 
Manage via ClouderaManager 
http://aws.amazon.com/about-aws/whats-new/2014/10/15/clouderas-enterprise- data-hub-edh-on-aws-quick-start/ 
AWS CloudFormationJSON Templates 
ClouderaEnterprise Reference Architecture on AWS
VPN 
Connection 
AWS Direct 
Connect 
Corporate Data center AWS Cloud 
Amazon S3 
logs / files 
Source DBs 
S3 Multipart 
Upload 
AWS Import/ 
Export 
Amazon RDS Amazon 
Glacier 
Amazon 
Kinesis 
Amazon 
DynamoDB 
Amazon 
Redshift 
Remote 
Loading 
using 
SSH 
Amazon Elastic 
MapReduce 
Amazon EC2 or 
On-Premise
Corporate data center 
DB 
Data Warehouse Extracts 
Amazon Redshift 
PostgreSQL/ODBC/JDBC 
Social media 
Amazon EMR/Spark/R/Mahout/Giraph 
Sqoop 
Hive/Shark - ODBC/JDBC 
AWS cloud 
Log files and 
unstructured data 
Hive 
Amazon DynamoDB 
RS COPY 
AWS Data Pipeline 
Amazon SWF 
Corporate data center 
Visualization and analysis 
(Tableau, Jaspersoft, etc.) 
ODBC/JDBC 
AWS cloud 
Visualization and 
analysis (Tableau, 
Jaspersoft, etc.) 
Presentation tools 
Amazon S3 
Gnip Data 
Collector 
Amazon 
Kinesis
Corporate data center 
ODBC/JDBC 
AWS cloud 
Corporate data center 
DB 
Data Warehouse Extracts 
Amazon Redshift 
PostgreSQL/ODBC/JDBC 
Social media 
Hive/Shark - ODBC/JDBC 
AWS cloud 
Log files and 
unstructured data 
AWS Data Pipeline 
Amazon SWF 
Amazon S3 
Gnip Data 
Collector 
Amazon 
Kinesis 
Cloudera EDH on Amazon EC2/Spark/R/Mahout/Giraph 
Sqoop 
MongoDB on AWS 
Visualization and analysis 
(Tableau, Jaspersoft, etc.) 
Presentation tools
Corporate data center 
DB 
Data Warehouse Extracts 
PostgreSQL/ODBC/JDBC 
Social media 
Sqoop 
Hive/Shark - ODBC/JDBC 
AWS cloud 
Log files and 
unstructured data 
Amazon SWF 
Corporate data center 
JSON 
AWS cloud 
Presentation tools 
Amazon S3 
Gnip Data 
Collector 
Amazon 
Kinesis 
MongoDB on AWS 
Presentation tools 
Amazon EMR/Spark/R/Mahout/Giraph 
AWS Data Pipeline
exponentiallyremoves Big Data constraints (three ‘v’) Hadoop in the cloudreal-time data analyticsagile Big Data platform
ICAO Headquarters 
ICAO Regional Office
cloudcloudin-housein-housesyncedcloud
Data 
BasicUI 
Create 
Read 
Update 
Delete 
Data 
FancyUI 
Read 
Metrics
Collect 
Map 
Reduce 
Publish 
Key 
Priority
tr -d "n" | tr -d "r" | 
sed "s#<Accident>#n<Accident>#g" 
Amazon S3 
Amazon 
EC2 
Use linux 
crontab 
to schedule 
Make one XML 
element per line for 
Amazon EMR
--put /home/ec2-user/key/newtest.pem 
--to /home/hadoop 
Put sshkey to hadoopif you need to remote sh
s3://elasticmapreduce/libs/script-runner/script-runner.jar 
Move the results from Amazon S3 to somewhere else
Amazon 
Elastic MapReduce 
Amazon S3
treat 
Amazon 
Elastic MapReduce 
Amazon S3
Amazon EC2 Amazon S3 Amazon EMR
Learn from AWS big data experts 
blogs.aws.amazon.com/bigdata 
BDT205: Your First Big Data Application on AWS 
BDT403: Netflix’s Next Generation Big Data Platform 
BDT305: Lessons Learned and Best Practices for Running Hadoop on AWS
Please give us your feedback on this session. 
Complete session evaluations and earn re:Invent swag. 
http://bit.ly/awsevals

More Related Content

What's hot

Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Eric Sun
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Databricks
 
End-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasEnd-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and Atlas
DataWorks Summit
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Brian O'Neill
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Vinoth Chandar
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
 
Ingesting data at scale into elasticsearch with apache pulsar
Ingesting data at scale into elasticsearch with apache pulsarIngesting data at scale into elasticsearch with apache pulsar
Ingesting data at scale into elasticsearch with apache pulsar
Timothy Spann
 
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Kim Hammar
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Amazon Web Services
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
Amazon Web Services
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
DataStax Academy
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Jie Li
 

What's hot (20)

Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
 
End-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasEnd-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and Atlas
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Ingesting data at scale into elasticsearch with apache pulsar
Ingesting data at scale into elasticsearch with apache pulsarIngesting data at scale into elasticsearch with apache pulsar
Ingesting data at scale into elasticsearch with apache pulsar
 
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
 

Viewers also liked

Jump Start your First Hour with AWS
Jump Start your First Hour with AWSJump Start your First Hour with AWS
Jump Start your First Hour with AWS
Amazon Web Services
 
Introduction to Amazon EC2
Introduction to Amazon EC2Introduction to Amazon EC2
Introduction to Amazon EC2
Amazon Web Services
 
Extend your Datacentre with the AWS Cloud
Extend your Datacentre with the AWS CloudExtend your Datacentre with the AWS Cloud
Extend your Datacentre with the AWS Cloud
Amazon Web Services
 
(ENT304) Governed, Trusted, and Rogue: The Good, the Bad, and the Ugly Inside...
(ENT304) Governed, Trusted, and Rogue: The Good, the Bad, and the Ugly Inside...(ENT304) Governed, Trusted, and Rogue: The Good, the Bad, and the Ugly Inside...
(ENT304) Governed, Trusted, and Rogue: The Good, the Bad, and the Ugly Inside...
Amazon Web Services
 
AWS Summit Stockholm 2014 – T1 – Architecting highly available applications o...
AWS Summit Stockholm 2014 – T1 – Architecting highly available applications o...AWS Summit Stockholm 2014 – T1 – Architecting highly available applications o...
AWS Summit Stockholm 2014 – T1 – Architecting highly available applications o...
Amazon Web Services
 
Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindingsDmitriy Lyubimov
 
How & When to Use NoSQL at Websummit Dublin
How & When to Use NoSQL at Websummit DublinHow & When to Use NoSQL at Websummit Dublin
How & When to Use NoSQL at Websummit Dublin
Amazon Web Services
 
(ENT302) Cost Optimization on AWS | AWS re:Invent 2014
(ENT302) Cost Optimization on AWS | AWS re:Invent 2014(ENT302) Cost Optimization on AWS | AWS re:Invent 2014
(ENT302) Cost Optimization on AWS | AWS re:Invent 2014
Amazon Web Services
 
How Companies are Using Cloud-Based Data Visualization & Analytics to Transfo...
How Companies are Using Cloud-Based Data Visualization & Analytics to Transfo...How Companies are Using Cloud-Based Data Visualization & Analytics to Transfo...
How Companies are Using Cloud-Based Data Visualization & Analytics to Transfo...
Amazon Web Services
 
(ENT401) Hybrid Infrastructure Integration | AWS re:Invent 2014
(ENT401) Hybrid Infrastructure Integration | AWS re:Invent 2014(ENT401) Hybrid Infrastructure Integration | AWS re:Invent 2014
(ENT401) Hybrid Infrastructure Integration | AWS re:Invent 2014
Amazon Web Services
 
(ARC401) Black-Belt Networking for the Cloud Ninja | AWS re:Invent 2014
(ARC401) Black-Belt Networking for the Cloud Ninja | AWS re:Invent 2014(ARC401) Black-Belt Networking for the Cloud Ninja | AWS re:Invent 2014
(ARC401) Black-Belt Networking for the Cloud Ninja | AWS re:Invent 2014
Amazon Web Services
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 

Viewers also liked (13)

Jump Start your First Hour with AWS
Jump Start your First Hour with AWSJump Start your First Hour with AWS
Jump Start your First Hour with AWS
 
Introduction to Amazon EC2
Introduction to Amazon EC2Introduction to Amazon EC2
Introduction to Amazon EC2
 
Extend your Datacentre with the AWS Cloud
Extend your Datacentre with the AWS CloudExtend your Datacentre with the AWS Cloud
Extend your Datacentre with the AWS Cloud
 
(ENT304) Governed, Trusted, and Rogue: The Good, the Bad, and the Ugly Inside...
(ENT304) Governed, Trusted, and Rogue: The Good, the Bad, and the Ugly Inside...(ENT304) Governed, Trusted, and Rogue: The Good, the Bad, and the Ugly Inside...
(ENT304) Governed, Trusted, and Rogue: The Good, the Bad, and the Ugly Inside...
 
AWS Summit Stockholm 2014 – T1 – Architecting highly available applications o...
AWS Summit Stockholm 2014 – T1 – Architecting highly available applications o...AWS Summit Stockholm 2014 – T1 – Architecting highly available applications o...
AWS Summit Stockholm 2014 – T1 – Architecting highly available applications o...
 
Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindings
 
How & When to Use NoSQL at Websummit Dublin
How & When to Use NoSQL at Websummit DublinHow & When to Use NoSQL at Websummit Dublin
How & When to Use NoSQL at Websummit Dublin
 
(ENT302) Cost Optimization on AWS | AWS re:Invent 2014
(ENT302) Cost Optimization on AWS | AWS re:Invent 2014(ENT302) Cost Optimization on AWS | AWS re:Invent 2014
(ENT302) Cost Optimization on AWS | AWS re:Invent 2014
 
How Companies are Using Cloud-Based Data Visualization & Analytics to Transfo...
How Companies are Using Cloud-Based Data Visualization & Analytics to Transfo...How Companies are Using Cloud-Based Data Visualization & Analytics to Transfo...
How Companies are Using Cloud-Based Data Visualization & Analytics to Transfo...
 
(ENT401) Hybrid Infrastructure Integration | AWS re:Invent 2014
(ENT401) Hybrid Infrastructure Integration | AWS re:Invent 2014(ENT401) Hybrid Infrastructure Integration | AWS re:Invent 2014
(ENT401) Hybrid Infrastructure Integration | AWS re:Invent 2014
 
(ARC401) Black-Belt Networking for the Cloud Ninja | AWS re:Invent 2014
(ARC401) Black-Belt Networking for the Cloud Ninja | AWS re:Invent 2014(ARC401) Black-Belt Networking for the Cloud Ninja | AWS re:Invent 2014
(ARC401) Black-Belt Networking for the Cloud Ninja | AWS re:Invent 2014
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 

Similar to (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonBig Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Data Con LA
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
Crishantha Nanayakkara
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
Amazon Web Services
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
Amazon Web Services
 
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
Jamie Kinney
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Precisely
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
Amazon Web Services
 
Concevoir une application scalable dans le Cloud
Concevoir une application scalable dans le CloudConcevoir une application scalable dans le Cloud
Concevoir une application scalable dans le Cloud
Stéphanie Hertrich
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
Amazon Web Services
 
Picking the right AWS backend for your Java application
Picking the right AWS backend for your Java applicationPicking the right AWS backend for your Java application
Picking the right AWS backend for your Java application
Julien SIMON
 
Picking the right AWS backend for your Java application (May 2017)
Picking the right AWS backend for your Java application (May 2017)Picking the right AWS backend for your Java application (May 2017)
Picking the right AWS backend for your Java application (May 2017)
Julien SIMON
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
ScyllaDB
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
DataWorks Summit
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
javier ramirez
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
gagravarr
 
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
nnakasone
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
Geoffrey Fox
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
Geoffrey Fox
 
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
Building a Modern Data Platform in the Cloud. AWS Initiate PortugalBuilding a Modern Data Platform in the Cloud. AWS Initiate Portugal
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
javier ramirez
 
Deep Dive on ArangoDB
Deep Dive on ArangoDBDeep Dive on ArangoDB
Deep Dive on ArangoDB
Max Neunhöffer
 

Similar to (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014 (20)

Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of AmazonBig Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Concevoir une application scalable dans le Cloud
Concevoir une application scalable dans le CloudConcevoir une application scalable dans le Cloud
Concevoir une application scalable dans le Cloud
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 
Picking the right AWS backend for your Java application
Picking the right AWS backend for your Java applicationPicking the right AWS backend for your Java application
Picking the right AWS backend for your Java application
 
Picking the right AWS backend for your Java application (May 2017)
Picking the right AWS backend for your Java application (May 2017)Picking the right AWS backend for your Java application (May 2017)
Picking the right AWS backend for your Java application (May 2017)
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
Building a Modern Data Platform in the Cloud. AWS Initiate PortugalBuilding a Modern Data Platform in the Cloud. AWS Initiate Portugal
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
 
Deep Dive on ArangoDB
Deep Dive on ArangoDBDeep Dive on ArangoDB
Deep Dive on ArangoDB
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 

Recently uploaded (20)

The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 

(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

  • 1.
  • 2. scale to infinityBig Data constraintsstrengths or limitationsHadoop ecosystem real-time analyticsBig Data partner solutionsworkflow automation
  • 3.
  • 4.
  • 5. Building the Square Kilometer Array (SKA) -the Biggest Radio Telescope SKA will process as much data every day as the world currently produces in a year Using AWS and crowd-sourced CPUs to analyze 400-500 galaxies simultaneously
  • 6. Mobile / Cable Telecom Oil & Gas Industrial Manufacturing Retail/Consumer Entertainment Hospitality Life Sciences Scientific Exploration Financial Services Publishing Media Advertising Online Media Social Network Gaming
  • 7. Unstructured data growth explosive, with estimates of compound annual growth (CAGR) at 62% Source: IDC Data volume -Gap 1990 2000 2010 2020 Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares Available for analysis Generated data
  • 8.
  • 9. Remove Constraints 100 instances x 1 hour
  • 10. Remove Constraints 1000 instances x 1 hour
  • 11. No upfrontcapital On-demandservices Elasticand scalable + + Pay for what you use + = AWS removes constraintsRemove Constraints
  • 12. Big Data Constraints •Volume: massive datasets •Variety: requiring new tools •Velocity: iterative, experimental data manipulation and analysis •Time to results: more critical than absolute performance AWS Cloud Computing •Virtually unlimited resources •Variety of compute solutions •Iterative, experimental usage/ deployment of infrastructure •Get faster results with effective parallel autonomous projects
  • 13. One tool to rule them all
  • 14. Foundation Services Storage (Object, Block and Archive) Networking Security & Access Control Compute (VMs, Auto-scaling and Load Balancing) Infrastructure Regions Availability Zones CDN and Points of Presence Platform Services Databases Relational NoSQL Caching Analytics Hadoop Real-time Data warehouse App Services Queuing Orchestration App streaming Transcoding Email Search Deployment & Management Containers Dev/ops Tools Resource Templates Mobile Services Identity Sync Mobile Analytics Notifications Data Workflows Usage Tracking Monitoring and Logs Enterprise Applications Virtual Desktops Collaboration and Sharing
  • 15.
  • 17. EMR Cluster S3 Choose: Hadoop distribution, # of nodes, types of nodes, custom configs, Hive/Pig/etc. Getting Started: http://docs.aws.amazon.com/gettingstarted/latest/emr/getting-started-emr-overview.html Put the data into Amazon S3 Launch the cluster using the console, CLI, SDK, or API You can easily add and remove nodes You can also store everything in HDFS Get the output from Amazon S3
  • 19. Basic statistics are suitable for Hadoop Some other Big Data problems are not suitable for Hadoop: dependenciessplits are interrelatedaccess data across splitsiterative computations Courtesy: http://www.amazon.com/Big-Data-Analytics-Beyond-Hadoop/dp/0133837947/
  • 20.
  • 21. •Accumulo–cell-based access control NoSQL •Avro –data serialization system •Cascading –alternative language APIs on MR •Cassandra –multi-master NoSQL DB •Chukwa–data collection system at scale •Flume –collecting, aggregating, moving logs •Giraph–iterative graph processing system •HBase–large table NoSQL DB •HDFS –distributed file system •Hive –SQL on MapReduceData Warehouse •Mahout –scalable machine learning library •MapReduce–parallel processing on YARN •Nutch–web crawler software •Pig –high-level scripting on MapReduce •R -statistical computing and graphics •Spark –general compute engine on YARN •Sqoop–transferring data to/from RDBMS •Tez–data-flow programming on YARN •Thrift –build scalable cross-language services •ZooKeeper–high-performance coordination Courtesy: http://www.apache.org/
  • 22. •Accumulo–cell-based access control NoSQL •Avro –data serialization system •Cascading –alternative language APIs on MR •Cassandra –multi-master NoSQL DB •Chukwa–data collection system at scale •Flume –collecting, aggregating, moving logs •Giraph–iterative graph processing system •HBase–large table NoSQL DB •HDFS –distributed file system •Hive –SQL on MapReduceData Warehouse •Mahout –scalable machine learning library •MapReduce–parallel processing on YARN •Nutch–web crawler software •Pig –high-level scripting on MapReduce •R -statistical computing and graphics •Spark –general compute engine on YARN •Sqoop–transferring data to/from RDBMS •Tez–data-flow programming on YARN •Thrift –build scalable cross-language services •ZooKeeper–high-performance coordination Courtesy: http://www.apache.org/
  • 23. scriptingstatistical analysis mixture of paradigmssingle-machine, single-thread Hadoop offers a path to scale R computation to distributed systems Courtesy: http://www.r-project.org/ http://www.amazon.com/Learning-R-Richard-Cotton/dp/1449357105/
  • 24. Ron every nodeHadoop Streaming Revolution Analytics RHadoop rmrmapreduce() rhdfs rhbaseRStudiohttp://blogs.aws.amazon.com/bigdata/post/Tx37RSKRFDQNTSL/Statistical-Analysis-with-Open-Source-R- and-RStudio-on-Amazon-EMR http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UseCase_Streaming.html References:
  • 25. scalable machine learning library Collaborative filtering (recommender engines) e.g. for movies, books, etc. based on comparing user preferences Clustering (unsupervised learning) e.g. identify groupings of related news stories based on input data properties Classification (supervised learning or predictive analytics) –e.g. spam filtering based on training spam data Courtesy: http://mahout.apache.org
  • 26. Mahout http://mahout.apache.org/users/classification/twenty-newsgroups.html http://blogs.aws.amazon.com/bigdata/post/Tx1TDK3HHBD4EZL/Building-a-Recommender-with-Apache- Mahout-on-Amazon-Elastic-MapReduce-EMR http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html References:
  • 27. Developed by Yahoo! based on Google Pregel(page rank) Customized by Facebook to scale on the full friendship graph (~1B vertices and ~ 100B edges) Single-vertex-centric API Bulk Synchronous Parallel machine Zookeeper enforced atomic barrier Iterations performed in memory Runs in mappers, or native YARN Courtesy: http://giraph.apache.org http://giraph.apache.org/pagerank.html https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920 Single Source Shortest Path Example values sent as messages (blue) BSP superstepvertex updates (red)
  • 28. configure-hadoopApache ZookeeperGiraphhttp://git-wip-us.apache.org/repos/asf/giraph.git Maven 3 JAR file Giraphjar http://giraph.apache.org/apidocs/org/apache/giraph/examples/SimplePageRankComputation.html http://giraph.apache.org/quick_start.html http://giraph.apache.org/build.html http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html References:
  • 29.
  • 30. Getting Started: http://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html Leader node (SQL clients, BI tools access) PostgreSQL endpoint Stores metadata Coordinates queries Ingestion Backup Restore Amazon S3 128GB RAM 16TB disk 16 cores Compute Node 128GB RAM 16TB disk 16 cores Compute Node 128GB RAM 16TB disk 16 cores Compute Node 10 GigE (HPC) SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores JDBC/ODBC LeaderNode Compute nodes Local, columnar storage Execute queries in parallel Amazon S3 load, backup/restore Integration with Amazon DynamoDB, EMR, Kinesis
  • 31. JDBC/ODBC Connect using drivers from PostgreSQL.org Amazon Redshift
  • 32.
  • 33. 2 years old all over the world1900 products 200 allow BYOL BI tools
  • 34. MongoDBon AWS Architecture Whitepaper Running MongoDB on Amazon EC2 Can easily launch a multi-node replica set Keep JSON templates in source control https://mongodb-documentation.readthedocs.org/en/latest/ecosystem/tutorial/automate- deployment-with-cloudformation.html AWS CloudFormationJSON Templates AMI in AWS Marketplace No extra cost
  • 35. Running ClouderaEDH on Amazon EC2 Clouderaon AWS Product Brief ClouderaEnterprise Data Hub on AWS Deploy via ClouderaDirector Manage via ClouderaManager http://aws.amazon.com/about-aws/whats-new/2014/10/15/clouderas-enterprise- data-hub-edh-on-aws-quick-start/ AWS CloudFormationJSON Templates ClouderaEnterprise Reference Architecture on AWS
  • 36.
  • 37. VPN Connection AWS Direct Connect Corporate Data center AWS Cloud Amazon S3 logs / files Source DBs S3 Multipart Upload AWS Import/ Export Amazon RDS Amazon Glacier Amazon Kinesis Amazon DynamoDB Amazon Redshift Remote Loading using SSH Amazon Elastic MapReduce Amazon EC2 or On-Premise
  • 38. Corporate data center DB Data Warehouse Extracts Amazon Redshift PostgreSQL/ODBC/JDBC Social media Amazon EMR/Spark/R/Mahout/Giraph Sqoop Hive/Shark - ODBC/JDBC AWS cloud Log files and unstructured data Hive Amazon DynamoDB RS COPY AWS Data Pipeline Amazon SWF Corporate data center Visualization and analysis (Tableau, Jaspersoft, etc.) ODBC/JDBC AWS cloud Visualization and analysis (Tableau, Jaspersoft, etc.) Presentation tools Amazon S3 Gnip Data Collector Amazon Kinesis
  • 39. Corporate data center ODBC/JDBC AWS cloud Corporate data center DB Data Warehouse Extracts Amazon Redshift PostgreSQL/ODBC/JDBC Social media Hive/Shark - ODBC/JDBC AWS cloud Log files and unstructured data AWS Data Pipeline Amazon SWF Amazon S3 Gnip Data Collector Amazon Kinesis Cloudera EDH on Amazon EC2/Spark/R/Mahout/Giraph Sqoop MongoDB on AWS Visualization and analysis (Tableau, Jaspersoft, etc.) Presentation tools
  • 40. Corporate data center DB Data Warehouse Extracts PostgreSQL/ODBC/JDBC Social media Sqoop Hive/Shark - ODBC/JDBC AWS cloud Log files and unstructured data Amazon SWF Corporate data center JSON AWS cloud Presentation tools Amazon S3 Gnip Data Collector Amazon Kinesis MongoDB on AWS Presentation tools Amazon EMR/Spark/R/Mahout/Giraph AWS Data Pipeline
  • 41. exponentiallyremoves Big Data constraints (three ‘v’) Hadoop in the cloudreal-time data analyticsagile Big Data platform
  • 42.
  • 43. ICAO Headquarters ICAO Regional Office
  • 45. Data BasicUI Create Read Update Delete Data FancyUI Read Metrics
  • 46.
  • 47. Collect Map Reduce Publish Key Priority
  • 48.
  • 49. tr -d "n" | tr -d "r" | sed "s#<Accident>#n<Accident>#g" Amazon S3 Amazon EC2 Use linux crontab to schedule Make one XML element per line for Amazon EMR
  • 50. --put /home/ec2-user/key/newtest.pem --to /home/hadoop Put sshkey to hadoopif you need to remote sh
  • 53. treat Amazon Elastic MapReduce Amazon S3
  • 54. Amazon EC2 Amazon S3 Amazon EMR
  • 55. Learn from AWS big data experts blogs.aws.amazon.com/bigdata BDT205: Your First Big Data Application on AWS BDT403: Netflix’s Next Generation Big Data Platform BDT305: Lessons Learned and Best Practices for Running Hadoop on AWS
  • 56. Please give us your feedback on this session. Complete session evaluations and earn re:Invent swag. http://bit.ly/awsevals