Big Data and the Cloud
Shekhar Vemuri
#phxdataconference
ABOUT
• PRINCIPAL at CLAIRVOYANT
• PRODUCT, DATA, ANALYTICS and CLOUD
• large scale web and data systems
• simple, lightweight solutions
QUICK POLL
• HADOOP, HIVE, PIG
• PUBLIC CLOUD, IaaS, SaaS
• AMAZON AWS, EC2
• ELASTICITY
• S3, EMR, KINESIS
• IoT
WHAT WILL WE TALK ABOUT
BIG
DATA
USE CASES
RISK MODELING
PERSONALIZED
MEDICINE
AD TARGETING
INTERNET OF
THINGS
THREAT
ANALYSIS
RECOMMENDATIONS
SURVEILLANCE RETENTION
360 CUSTOMER
VIEW
DRIVING FACTORS
• variety in data
• not just transactional data
• potential for tremendous insight - when combining
transactional data with additional data sources
• LinkedIn, Twitter, Facebook, Pinterest , Open Data
• Internet of Things
the
CLOUD
the CLOUD
• IaaS, SaaS
• on demand subscription
• subscription vs owning
• tradeoff
• ease of adoption
• powering nextgen entrepreneurship
LANDSCAPE
DATA VALUE
CHAIN
101010101
1010101010
10
1010101010
1010101010
1010101010
1010101
GENERATE STORE ANALYZE INSIGHTS
> > >
DATA VALUE CHAIN
ingest transform transform
BIG DATA + the
CLOUD
LOG
ANALYSIS
AMAZON S3
AMAZON EC2
LOG FILES
ReST CLIENTS
WEB APP, REST APIs
AMAZON EMR
AMAZON S3
AMAZON EC2
LOG FILES
ReST CLIENTS
WEB APP, REST APIs
AMAZON REDSHIFT
LOG FILES - STORED in S3
MAP-REDUCE, HIVE,
PIG, CASCADING jobs
STORE summarized data
AMAZON EMR
AMAZON S3
AMAZON EC2
LOG FILES
ReST CLIENTS
WEB APP, REST APIs
LOG FILES - STORED in S3
MAP-REDUCE, HIVE,
PIG, CASCADING jobs
CLOUDERA IMPALA
AMAZON S3
AMAZON KINESIS
AMAZON REDSHIFT AMAZON DYNAMODBAMAZON RDS
AMAZON EMR
AMAZON S3
DATA
AMAZON S3
INPUT
AMAZON EMR
AMAZON S3
INPUT
OUTPUT
AMAZON EMR
AMAZON S3
INPUT
OUTPUT
AMAZON EMR
AMAZON EMR
WITH SPOT instances
BUILDING BLOCKS
• amazon AWS
• amazon EMR
• amazon S3
• kinesis
• redshift
• spot instances
HEADER
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut
enim ad minim veniam, quis nostrud.
SUBHEADER
exercitation ullamco laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in reprehenderit in voluptate velit
esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui
officia deserunt mollit anim id est laborum.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum
dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat
non proident, sunt in culpa qui officia deserunt mollit anim id est
laborum.
PROS
• like other cloud solutions - reduces the barrier to
adoption
• especially if you are already in the cloud
• can provide ability to implement quick POCs
HEADER
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut
enim ad minim veniam, quis nostrud.
SUBHEADER
exercitation ullamco laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in reprehenderit in voluptate velit
esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui
officia deserunt mollit anim id est laborum.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum
dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat
non proident, sunt in culpa qui officia deserunt mollit anim id est
laborum.
0. 1.3 2.5 3.8 5. 6.3
Category 4
Category 3
Category 2
Category 1
Series 3 Series 2
Series 1
CONS
• depending on your current infrastructure - may end up
continually replicating data
• data security, privacy
LEARNINGS
• Build platforms once the need is strongly felt
• Prepare to Fail fast, couple of times before the final
version
• what you think will happen, will not
LEARNINGS
• COSTS can spiral out of control
• Leverage spot instances to reduce costs, especially for
bursty workloads
• S3 Can be very slow to run and initialize large workloads
• especially in recovery scenarios
• but data resiliency is not an issue
www.clairvoyantsoft.com
@shekharv
linkedin.com/in/shekharvemuri

Big data in the cloud - Shekhar Vemuri

  • 1.
    Big Data andthe Cloud Shekhar Vemuri #phxdataconference
  • 2.
    ABOUT • PRINCIPAL atCLAIRVOYANT • PRODUCT, DATA, ANALYTICS and CLOUD • large scale web and data systems • simple, lightweight solutions
  • 3.
    QUICK POLL • HADOOP,HIVE, PIG • PUBLIC CLOUD, IaaS, SaaS • AMAZON AWS, EC2 • ELASTICITY • S3, EMR, KINESIS • IoT
  • 4.
    WHAT WILL WETALK ABOUT
  • 5.
  • 6.
    USE CASES RISK MODELING PERSONALIZED MEDICINE ADTARGETING INTERNET OF THINGS THREAT ANALYSIS RECOMMENDATIONS SURVEILLANCE RETENTION 360 CUSTOMER VIEW
  • 7.
    DRIVING FACTORS • varietyin data • not just transactional data • potential for tremendous insight - when combining transactional data with additional data sources • LinkedIn, Twitter, Facebook, Pinterest , Open Data • Internet of Things
  • 8.
  • 9.
    the CLOUD • IaaS,SaaS • on demand subscription • subscription vs owning • tradeoff • ease of adoption • powering nextgen entrepreneurship
  • 10.
  • 11.
  • 12.
    101010101 1010101010 10 1010101010 1010101010 1010101010 1010101 GENERATE STORE ANALYZEINSIGHTS > > > DATA VALUE CHAIN ingest transform transform
  • 13.
    BIG DATA +the CLOUD
  • 15.
  • 16.
    AMAZON S3 AMAZON EC2 LOGFILES ReST CLIENTS WEB APP, REST APIs
  • 17.
    AMAZON EMR AMAZON S3 AMAZONEC2 LOG FILES ReST CLIENTS WEB APP, REST APIs AMAZON REDSHIFT LOG FILES - STORED in S3 MAP-REDUCE, HIVE, PIG, CASCADING jobs STORE summarized data
  • 18.
    AMAZON EMR AMAZON S3 AMAZONEC2 LOG FILES ReST CLIENTS WEB APP, REST APIs LOG FILES - STORED in S3 MAP-REDUCE, HIVE, PIG, CASCADING jobs CLOUDERA IMPALA
  • 19.
    AMAZON S3 AMAZON KINESIS AMAZONREDSHIFT AMAZON DYNAMODBAMAZON RDS AMAZON EMR
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
    BUILDING BLOCKS • amazonAWS • amazon EMR • amazon S3 • kinesis • redshift • spot instances
  • 25.
    HEADER Lorem ipsum dolorsit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud. SUBHEADER exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
  • 26.
    PROS • like othercloud solutions - reduces the barrier to adoption • especially if you are already in the cloud • can provide ability to implement quick POCs
  • 27.
    HEADER Lorem ipsum dolorsit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud. SUBHEADER exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. 0. 1.3 2.5 3.8 5. 6.3 Category 4 Category 3 Category 2 Category 1 Series 3 Series 2 Series 1
  • 28.
    CONS • depending onyour current infrastructure - may end up continually replicating data • data security, privacy
  • 29.
    LEARNINGS • Build platformsonce the need is strongly felt • Prepare to Fail fast, couple of times before the final version • what you think will happen, will not
  • 30.
    LEARNINGS • COSTS canspiral out of control • Leverage spot instances to reduce costs, especially for bursty workloads • S3 Can be very slow to run and initialize large workloads • especially in recovery scenarios • but data resiliency is not an issue
  • 31.

Editor's Notes

  • #7 things that have been done for a while - but in a more closed loop and faster tighter loops
  • #8 sentiment analysis
  • #10 OPEX vs CAPEX Moving away from large upfront investments and costs.. Tradeoff - in personnel cost - flexibility - demand feasibility etc. any startup - and anyone with an idea can lean startups, mvps prove before you expand and try.
  • #12 EDH EDW
  • #13 GENERATE -> INGEST - flume, fluentd, kafka, kinesis
  • #14 EDH EDW
  • #15 EDH EDW
  • #16 EDH EDW
  • #17 EDH EDW
  • #18 EDH EDW
  • #19 BOOTSTRAP actions
  • #20 BOOTSTRAP actions
  • #21 BOOTSTRAP actions
  • #22 BOOTSTRAP actions
  • #23 BOOTSTRAP actions
  • #24 BOOTSTRAP actions
  • #25 m3 and c3 xlarge
  • #27 sentiment analysis
  • #30 If you are an enterprise adopting big data, you need laser focus on proving the business value. Focus on the big problems first Iterate rapidly. Spinning things up and down is cheap.