The Presentation Talks about how Cloud Computing is Big Data's Best Friend and How AWS Cloud Components Fit in to complete your Big Data Life Cycle.
Agenda:
- How Big is Big Data Actually growing?
- How Cloud has the potential to become Big Data's Best Friend
- A tour on The Big Data Life Cycle
- How AWS Cloud Components Fit in to this Life Cycle
- A Case Study of Our Log Analytics Tool Cloudlytics, using Big Data Implementation
on AWS Cloud.
2. Agenda
1
5
Big Data is getting
Bigger and Bigger !
3
2
Figuring Out the
Big Data Life Cycle
4
How AWS Building Blocks
can Help Tame Big Data!
Why is Cloud Big Data’s
Best Friend ?
Cloud IT Better
How Cloudlytics Uses
AWS Cloud for its Big Data
2
3. So What is Big Data ?
Simply put, Big Data is
data which cannot be
processed by the current
tools or technologies. Big
Data is too Big, too Fast
and too Varied.
Cloud IT Better
3
High Resolution images
from NASA, our place in
the cosmos!
4. The 3 V’s that make Big Data difficult to Tame!
Volume
Conventional
Databases allow
processing of
data in batches,
it could take
days weeks to
process one
batch of Big
Data.
Cloud IT Better
Variety
Data from social networks, sensors installed at
store entrances, traffic lights, in airplanes, Car
GPS and countless other sources !!
2.5 quintillion
bytes of Data is
generated
everyday!
4
Velocity
Twitter Generates 5 Giga Bytes of data/min
Facebook generates 7 Giga Bytes of data/min.
5. Big Data is Getting Bigger and BIGGER!
“ It is estimated that Walmart collects
more than 2.5 petabytes of data EVERY
HOUR from its customer transactions ”
“ More data
crosses the
internet EVERY
SECOND than
were stored in
the entire
internet just
20 years ago? “
“ Zuckerberg noted that 1
billion pieces of content
are shared via Facebook’s
Open Graph DAILY ! “
Cloud IT Better
5
6. Why is Cloud Big Data’s Best Friend ?
With Big Data, we Know
we want to Generate,
Store, Analyze & Share.
But How does Cloud
come in to Picture?
Cloud IT Better
6
7. Our IT Resources are Limited & Precious!
And, Cloud has
The Solution for this !!
Cloud IT Better
7
8. Cloud Has Many Advantages
Elasticity
Fast Time to
Market
On Demand
Flexible
Cost Effective
Pay
Per Use
Secure
Resilient
Cloud IT Better
No CapEx
Remote Access
8
Scalable
Pooled Resources
9. Cloud Optimizes Your IT Resources
Cloud Makes Sure that Your
Precious IT Resources are
OPTIMIZED
Cloud IT Better
9
10. Cloud makes it Easy!
Cloud Makes Big Data
Easier
To Handle
Image Courtesy: http://www.slideshare.net/AmazonWebServicesLATAM/big-data-on-aws?
Cloud IT Better
10
11. Let us Figure out the Big Data Life Cycle
Generation
In order to make the entire process of
Big Data more tangible, it is divided
into 4 stages:
Data
Collaboration
& Sharing
Collection
& Store
Analyze &
Computation
Cloud IT Better
11
12. Generating the Data
Financial
analysis
Scientific
simulations
Structured Data –
Employee Records
Semi Structured Data –
End User Logs
Bioinformatics
research
Data
warehousing
Generation
Data
Collaboration
& Sharing
Web based APIs can be used
to access this data and Store it.
12
Web indexing
Log file analysis
Data Mining
Unstructured Data –
Social User Profile images
Cloud IT Better
Machine learning
Collection
& Store
Analyze &
Computation
13. Transferring Your Data to AWS Cloud
To transfer your Data Sets on to the Cloud You can Use:
AWS Import/Export
AWS Storage Gateway
Move large amounts of data into and out of AWS
using portable storage devices for transport
Secure Integration between an On-premises
IT & AWS’s storage infrastructure
AWS Direct Connect
Establish a dedicated network connection
from your premises to AWS
Cloud IT Better
13
14. Collecting & Storing Data on AWS Cloud
AWS Relational
Database Service (RDS)
Simple Storage Service (S3)
Write, read, and delete objects
containing from 1 byte to 5
terabytes of data each.
A full featured relational databases giving you
access to capabilities of a MySQL, Oracle, SQL
Server, or PostgreSQL databases engines
AWS DynamoDB
A fast, fully managed NoSQL database service
making it simple & cost-effective to store & retrieve
any amount of data, and serve any level of request traffic.
Cloud IT Better
14
15. Data Analysis on AWS Cloud
Once You’ve
stored your
Content On
Cloud, It is
Time to
Analyze It !!
http://dorkutopia.com/wp-content/uploads/2013/06
Cloud IT Better
So if you’re Thinking
implementing a
Hadoop
Infrastructure ……
/
15
16. Data Analysis on AWS Cloud
Setting Up a
Hadoop
Infrastructure
is not that Easy,
But AWS Has the
Answer !
Image courtesy: http://globalgeeknews.com/wp-content/uploads/
Cloud IT Better
16
17. Data Analysis on AWS Cloud
Amazon Elastic Map Reduce (EMR)
• A managed Hadoop distribution by Amazon Web Services using customized
Apache Hadoop framework
• Using MapReduce, in which a data processing tasks are mapped to set of servers
in a cluster for processing.
• EMR integrates with AWS S3 (an alternative Storage
to HDFS) & EC2(Compute Instances).
• EMR allows you to tune the default Hadoop Job Flows to your custom needs.
• The various How To’s of Hadoop Architecture such as adding,
removing & configuring nodes is taken care of by EMR.
Cloud IT Better
17
18. AWS Redshift for Retrieval & Collaboration
Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service
making it simple & cost-effective to efficiently analyze all your data using your
existing business intelligence tools.
• Amazon Redshift has a massively parallel processing (MPP)
architecture, parallelizing and
distributing SQL operations.
• You can use AWS Redshift to Store and retrieve processed
data quickly, to generate custom based Reports.
AWS Redshift
Cloud IT Better
18
19. AWS Data Pipelines for Automation
AWS Data pipeline allows users to define a dependent chain
of data sources and destinations with an option to create data
processing activities called pipeline.
Input
Node
Activity
•
•
•
•
•
Can be implemented across all stages of Big Data Life Cycle.
Tasks Scheduled to perform Data movement and processing Activities.
Failure & Retry options in Data pipeline workflows also Available.
Input & Output Data nodes support S3 Bucket, DynamoDB, MySQL DB & SQL Data Source.
Activities currently supported are Copy, EMR, Hive & Shell Activity.
Output
Node
Cloud IT Better
19
20. AWS Kinesis (NEW)
Amazon Kinesis is a fully managed service for real-time processing of streaming data at
massive scale. Amazon Kinesis can collect and process hundreds of TBs of data/hr from hundreds of
thousands of sources.
• Real Time Processing allowing you to answer questions
about the current state of your data.
• Amazon Kinesis automatically provisions &
manages the storage required to reliably &
durably collect your data stream.
• You can add as many as kinesis Streams as desired based on
the volume & variety of Data.
• Your Kinesis Streams are connected to your Kinesis App
from which you can use DynamoDB or Redshift
to process complex queries at real Time.
Image courtesy: https://static.gosquared.com/images/liquidicity/kinesis/
Cloud IT Better
20
21. The Big Data Life cycle - Compiled
Generation
AWS S3
AWS RDS
AWS DynamoDB
AWS Redshift
AWS Data Pipeline
Data
Collaboration
& Sharing
Collection
& Store
Analyze &
Computation
AWS EMR
AWS Data Pipeline
Cloud IT Better
21
AWS S3
Component Description
AWS RDS
………………………
AWS DynamoDB ………………
AWS Data Pipeline
........
…………….
…….
22. Use Case - Cloudlytics
Cloudlytics is a Pay-as-you-Go, SaaS based Log Analytics Tool powered by AWS. It
Takes the Big Data Approach using AWS Components such as EMR & Redshift.
Processed
Data
Processing
Customer Log Files
Stored in S3
Customer
Reports
Cloud IT Better
22