(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

scale to infinityBig Data constraintsstrengths or limitationsHadoop ecosystem real-time analyticsBig Data partner solutionsworkflow automation

Building the Square Kilometer Array (SKA) -the Biggest Radio Telescope
SKA will process as much data every day as the world currently produces in a year
Using AWS and crowd-sourced CPUs to analyze 400-500 galaxies simultaneously

Mobile / Cable Telecom
Oil & Gas Industrial Manufacturing
Retail/Consumer Entertainment Hospitality
Life Sciences Scientific Exploration
Financial Services
Publishing Media Advertising
Online Media Social Network Gaming

Unstructured data growth explosive, with estimates of compound annual growth (CAGR) at 62%
Source: IDC
Data volume -Gap
1990
2000
2010
2020
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Available for analysis
Generated data

Remove Constraints
100 instances
x 1 hour

Remove Constraints
1000 instances
x 1 hour

No upfrontcapital
On-demandservices
Elasticand scalable
+
+
Pay for what you use
+
=
AWS removes constraintsRemove Constraints

Big Data Constraints
•Volume: massive datasets
•Variety: requiring new tools
•Velocity: iterative, experimental data manipulation and analysis
•Time to results: more critical than absolute performance
AWS Cloud Computing
•Virtually unlimited resources
•Variety of compute solutions
•Iterative, experimental usage/ deployment of infrastructure
•Get faster results with effective parallel autonomous projects

Foundation Services
Storage
(Object, Block and Archive)
Networking
Security & Access Control
Compute
(VMs, Auto-scaling and Load Balancing)
Infrastructure
Regions
Availability Zones
CDN and Points of Presence
Platform Services
Databases
Relational
NoSQL
Caching
Analytics
Hadoop
Real-time
Data warehouse
App Services
Queuing
Orchestration
App streaming
Transcoding
Email
Search
Deployment & Management
Containers
Dev/ops Tools
Resource Templates
Mobile Services
Identity
Sync
Mobile Analytics
Notifications
Data Workflows
Usage Tracking
Monitoring and Logs
Enterprise
Applications
Virtual Desktops
Collaboration and Sharing

Courtesy: http://techblog.netflix.com/2013/01/hadoop- platform-as-service-in-cloud.htmlHDFSYARNMapReduce

EMR Cluster
S3
Choose: Hadoop distribution, # of nodes, types of nodes, custom configs, Hive/Pig/etc.
Getting Started: http://docs.aws.amazon.com/gettingstarted/latest/emr/getting-started-emr-overview.html
Put the data into Amazon S3
Launch the cluster using the console, CLI, SDK, or API
You can easily add and remove nodes
You can also store everything in HDFS
Get the output from Amazon S3

References:
http://aws.amazon.com/elasticmapreduce/getting-started/
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.htmlHadoop 2.4.0
Hive, Pig, HBase, Impala, Gangliaencryption
Consistent view on every cluster node

Basic statistics are suitable for Hadoop
Some other Big Data problems are not suitable for Hadoop: dependenciessplits are interrelatedaccess data across splitsiterative computations
Courtesy: http://www.amazon.com/Big-Data-Analytics-Beyond-Hadoop/dp/0133837947/

•Accumulo–cell-based access control NoSQL
•Avro –data serialization system
•Cascading –alternative language APIs on MR
•Cassandra –multi-master NoSQL DB
•Chukwa–data collection system at scale
•Flume –collecting, aggregating, moving logs
•Giraph–iterative graph processing system
•HBase–large table NoSQL DB
•HDFS –distributed file system
•Hive –SQL on MapReduceData Warehouse
•Mahout –scalable machine learning library
•MapReduce–parallel processing on YARN
•Nutch–web crawler software
•Pig –high-level scripting on MapReduce
•R -statistical computing and graphics
•Spark –general compute engine on YARN
•Sqoop–transferring data to/from RDBMS
•Tez–data-flow programming on YARN
•Thrift –build scalable cross-language services
•ZooKeeper–high-performance coordination
Courtesy: http://www.apache.org/

scriptingstatistical analysis mixture of paradigmssingle-machine, single-thread
Hadoop offers a path to scale R computation to distributed systems
Courtesy: http://www.r-project.org/
http://www.amazon.com/Learning-R-Richard-Cotton/dp/1449357105/

Ron every nodeHadoop Streaming Revolution Analytics RHadoop
rmrmapreduce()
rhdfs
rhbaseRStudiohttp://blogs.aws.amazon.com/bigdata/post/Tx37RSKRFDQNTSL/Statistical-Analysis-with-Open-Source-R- and-RStudio-on-Amazon-EMR
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UseCase_Streaming.html
References:

scalable machine learning library
Collaborative filtering (recommender engines) e.g. for movies, books, etc. based on comparing user preferences
Clustering (unsupervised learning) e.g. identify groupings of related news stories based on input data properties
Classification (supervised learning or predictive analytics) –e.g. spam filtering based on training spam data
Courtesy: http://mahout.apache.org

Mahout
http://mahout.apache.org/users/classification/twenty-newsgroups.html
http://blogs.aws.amazon.com/bigdata/post/Tx1TDK3HHBD4EZL/Building-a-Recommender-with-Apache- Mahout-on-Amazon-Elastic-MapReduce-EMR
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html
References:

Developed by Yahoo! based on Google Pregel(page rank)
Customized by Facebook to scale on the full friendship graph (~1B vertices and ~ 100B edges)
Single-vertex-centric API
Bulk Synchronous Parallel machine
Zookeeper enforced atomic barrier
Iterations performed in memory
Runs in mappers, or native YARN
Courtesy: http://giraph.apache.org
http://giraph.apache.org/pagerank.html
https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920
Single Source Shortest Path Example
values sent as messages (blue)
BSP superstepvertex updates (red)

configure-hadoopApache ZookeeperGiraphhttp://git-wip-us.apache.org/repos/asf/giraph.git
Maven 3 JAR file Giraphjar http://giraph.apache.org/apidocs/org/apache/giraph/examples/SimplePageRankComputation.html
http://giraph.apache.org/quick_start.html
http://giraph.apache.org/build.html
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html
References:

Getting Started: http://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html
Leader node (SQL clients, BI tools access)
PostgreSQL endpoint
Stores metadata
Coordinates queries
Ingestion
Backup
Restore
Amazon S3
128GB RAM
16TB disk
16 cores
Compute Node
128GB RAM
16TB disk
16 cores
Compute Node
128GB RAM
16TB disk
16 cores
Compute Node
10 GigE
(HPC)
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
JDBC/ODBC
LeaderNode
Compute nodes
Local, columnar storage
Execute queries in parallel
Amazon S3 load, backup/restore
Integration with Amazon DynamoDB, EMR, Kinesis

JDBC/ODBC
Connect using drivers from PostgreSQL.org
Amazon Redshift

2 years old all over the world1900 products 200 allow BYOL BI tools

MongoDBon AWS Architecture Whitepaper
Running MongoDB on Amazon EC2
Can easily launch a multi-node replica set
Keep JSON templates in source control
https://mongodb-documentation.readthedocs.org/en/latest/ecosystem/tutorial/automate- deployment-with-cloudformation.html
AWS CloudFormationJSON Templates
AMI in AWS Marketplace
No extra cost

Running ClouderaEDH on Amazon EC2
Clouderaon AWS Product Brief
ClouderaEnterprise Data Hub on AWS
Deploy via ClouderaDirector
Manage via ClouderaManager
http://aws.amazon.com/about-aws/whats-new/2014/10/15/clouderas-enterprise- data-hub-edh-on-aws-quick-start/
AWS CloudFormationJSON Templates
ClouderaEnterprise Reference Architecture on AWS

VPN
Connection
AWS Direct
Connect
Corporate Data center AWS Cloud
Amazon S3
logs / files
Source DBs
S3 Multipart
Upload
AWS Import/
Export
Amazon RDS Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
Redshift
Remote
Loading
using
SSH
Amazon Elastic
MapReduce
Amazon EC2 or
On-Premise

Corporate data center
DB
Data Warehouse Extracts
Amazon Redshift
PostgreSQL/ODBC/JDBC
Social media
Amazon EMR/Spark/R/Mahout/Giraph
Sqoop
Hive/Shark - ODBC/JDBC
AWS cloud
Log files and
unstructured data
Hive
Amazon DynamoDB
RS COPY
AWS Data Pipeline
Amazon SWF
Visualization and analysis
(Tableau, Jaspersoft, etc.)
ODBC/JDBC
AWS cloud
Visualization and
analysis (Tableau,
Jaspersoft, etc.)
Presentation tools
Amazon S3
Gnip Data
Collector
Amazon
Kinesis

ODBC/JDBC
AWS cloud
DB
Amazon Redshift
Social media
AWS cloud
Log files and
unstructured data
AWS Data Pipeline
Amazon SWF
Amazon S3
Gnip Data
Collector
Amazon
Kinesis
Cloudera EDH on Amazon EC2/Spark/R/Mahout/Giraph
Sqoop
MongoDB on AWS
Visualization and analysis
(Tableau, Jaspersoft, etc.)
Presentation tools

DB
Social media
Sqoop
AWS cloud
Log files and
unstructured data
Amazon SWF
JSON
AWS cloud
Presentation tools
Amazon S3
Gnip Data
Collector
Amazon
Kinesis
MongoDB on AWS
Presentation tools
Amazon EMR/Spark/R/Mahout/Giraph
AWS Data Pipeline

exponentiallyremoves Big Data constraints (three ‘v’) Hadoop in the cloudreal-time data analyticsagile Big Data platform

ICAO Headquarters
ICAO Regional Office

cloudcloudin-housein-housesyncedcloud

Data
BasicUI
Create
Read
Update
Delete
Data
FancyUI
Read
Metrics

Collect
Map
Reduce
Publish
Key
Priority

tr -d "n" | tr -d "r" |
sed "s#<Accident>#n<Accident>#g"
Amazon S3
Amazon
EC2
Use linux
crontab
to schedule
Make one XML
element per line for
Amazon EMR

--put /home/ec2-user/key/newtest.pem
--to /home/hadoop
Put sshkey to hadoopif you need to remote sh

s3://elasticmapreduce/libs/script-runner/script-runner.jar
Move the results from Amazon S3 to somewhere else

Amazon
Elastic MapReduce
Amazon S3

treat
Amazon
Elastic MapReduce
Amazon S3

Amazon EC2 Amazon S3 Amazon EMR

Learn from AWS big data experts
blogs.aws.amazon.com/bigdata
BDT205: Your First Big Data Application on AWS
BDT403: Netflix’s Next Generation Big Data Platform
BDT305: Lessons Learned and Best Practices for Running Hadoop on AWS

Please give us your feedback on this session.
Complete session evaluations and earn re:Invent swag.
http://bit.ly/awsevals

(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

Similar to (BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014