SlideShare a Scribd company logo
1 of 86
Introduction to AWS Big Data
Omid Vahdaty, Big Data Ninja
When the data outgrows your ability to process
● Volume
● Velocity
● Variety
EMR
● Basic, the simplest distribution of all hadoop distributions.
● Not as good as cloudera.
Collect
● Firehose
● Snowball
● SQS
● Ec2
Store
● S3
● Kinesis
● RDS
● DynmoDB
● Clloud Search
● Iot
Process + Analyze
● Lambda
● EMR
● Redshift
● Machine learning
● Elastic search
● Data Pipeline
● Athena
Visulize
● Quick sight
● ElasticSearch service
History of tools
HDFS - from google
Cassandra - from facebook, Columnar store NoSQL , materialized view, secondary
index
Kafka - linkedin.
Hbase, no sql , hadoop eco system.
EMR Hadoop ecosystem
● Spark - recommend options. Can do all the below…
● Hive
● Oozie
● Mahout - Machine library (mllib is better, runs on spark)
● Presto, better than hive? More generic than hive.
● Pig - scripting for big data.
● Impala - Cloudera, and part of EMR.
ETL Tools
● Attunity
● Splunk
● Semarchy
● Informatica
● Tibco
● Clarit
Architecture
● Decouple :
○ Store
○ Process
○ Store
○ Process
○ insight...
● Rule of thumb: 3 tech in dc, 7 techs max in cloud
○ Do use more b/c: maintenance
Architecture considerations
● unStructured? Structured? Semi structured?
● Latency?
● Throughput?
● Concurrency?
● Access patterns?
● Pass? Max 7 technologies
● Iaas? Max 4 technologies
EcoSystem
● Redshift = analytics
● Aurora = OLTP
● DynamoDB = NoSQL like mongodb.
● Lambda
● SQS
● Cloud Search
● Elastic Search
● Data Pipeline
● Beanstalk - i have jar, install it for me…
● AWS machine learning
Data ingestions Architecture challenges
● Durability
● HA
● Ingestions types:
○ Batch
○ Stream
● Transaction: OLTP, noSQL.
Ingestion options
● Kinesis
● Flume
● Kafka
● S3DistCP copy from
○ S3 to HDFS
○ S3 to s3
○ Cross account
○ Supports compression.
Transfer
● VPN
● Direct Connect
● S3 multipart upload
● Snowball
● IoT
Steaming
● Streaming
● Batch
● Collect
○ Kinesys
○ DynamoDB streams
○ SQS (pull)
○ SNS (push)
○ KAFKA (Most recommend)
Comparison
Streaming
● Low latency
● Message delivery
● Lambda architecture implementation
● State management
● Time or count based windowing support
● Fault tolerant
Stream processor comparison
Stream collection options
Kinesis client library
AWS lambda
EMR
Third party
Spark streaming (latency min =1 sec) , near real time, with lot of libraries.
Storm - Most real time (sub millisec), java code based.
Flink (similar to spark)
Kinesys
● Stream - collect@source and near real time processing
○ Near real time
○ High throughput
○ Low cost
○ Easy administration - set desired level of capacity
○ Delivery to : s3,redshift, Dynamo
○ Ingress 1mb, egress 2mbs. Upto 1000 Transaction per second.
● Analytics - in flight analytics.
● Firehose - Park you data @ destination.
Kinesis analytics example
CREATE STREAM "INTERMEDIATE_STREAM" (
hostname VARCHAR(1024),
logname VARCHAR(1024),
username VARCHAR(1024),
requesttime VARCHAR(1024),
request VARCHAR(1024),
status VARCHAR(32),
responsesize VARCHAR(32)
);
-- Data Pump: Take incoming data from SOURCE_SQL_STREAM_001 and insert
into INTERMEDIATE_STREAM
KCL
● Read from stream using get api
● Build application with the KCL
● Leverage kinesis spout for storm
● Leverage EMR connector
Firehose - for parking
● Not for fast lane - no in flight analytics
● Capture , transform and load.
○ Kinesis
○ S3
○ Redshift
○ elastic search
● Managed Service
● Producer - you input to delivery stream
● Buffer size MB
● Buffer in seconds.
Comparison of Kinesis product
● Stream
○ Sub 1 sec processing latency
○ Choice of stream processor (generic)
○ For smaller events
● Firehose
○ Zero admin
○ 4 targets built in (redshift, s3, search, etc)
○ Latency 60 sec minimum.
○ For larger “events”
DynamoDB
● Fully managed NoSql document store key value
○ Tables have not fixed schema
● High performance
○ Single digit latency
○ Runs on solid drives
● Durable
○ Multi az
○ Fault toulerant, replicated 3 AZ.
Durability
● Read
○ Eventually consistent
○ Strongly consistent
● Write
○ Quorum ack
○ 3 replication - always - we can’t change it.
○ Persistence to disk.
Indexing and partitioning
● Idnexing
○ LSI - local secondary index, kind of alternate range key
○ GSI - global secondary index - “pivot charts” for your table, kind of projection (vertica)
● Partitioning
○ Automatic
○ Hash key speared data across partitions
DynamoDB Items and Attributes
● Partitions key
● Sort key (optional)
● LSI
● attributes
AWS Titan(on DynamoDB) - Graph DB
● Vertex - nodes
● Edge- relationship b/w nodes.
● Good when you need to investigate more than 2 layer relationship (join of 4 tables)
● Based on TinkerPop (open source)
● Full text search with lucene SOLR or elasticsearch
● HA using multi master replication (cassandra backed)
● Scale using DynamoDB backed.
● Use cases: Cyber , Social networks, Risk management
Elasticache: MemCache / Redis
● Cache sub millisecond.
● In memory
● No disk I/O when querying
● High throughput
● High availability
● Fully managed
Redshift
● Peta scale database for analytics
○ Columnar
○ MPP
● Complex SQL
RDS
● Flavours
○ Mysql
○ Auraora
○ postgreSQL
○ Oracle
○ MSSQL
● Multi AZ, HA
● Managed services
Data Processing
● Batch
● Ad hoc queries
● Message
● Stream
● Machine learning
● (by the use parquet, not ORC, more commonly used in the ecosystem)
Athena
● Presto
● In memory
● Hive meta store for DDL functionality
○ Complex data types
○ Multiple formats
○ Partitions
EMR
● Emr , pig and hive can work on top of spark , but not yet in EMR
● TEZ is not working. Horton gave up on it.
● Can ran presto on hive
Best practice
● Bzip2 - splittable compression
○ You don't have to open all the file, you can bzip2 only one block in it and work in parallel
● Snappy - encoding /decoding time - VERY fast, not compress well.
● Partitions
● Ephemeral EMR
EMR ecosystem
● Hive
● Pig
● Hue
● Spark
● Oozie
● Presto
● Ganglia
● Zookeeper
● zeppelin
● For research - Public data sets exist of AWS...
EMR ARchitecture
● Master node
● Core nodes - like data nodes (with storage)
● Task nodes - (not like regular hadoop, extends compute)
● Default replication factor
○ Nodes 1-3 ⇒ 1 factor
○ Nodes 4-9 ⇒ 2 replication factor
○ Nodes 10+ ⇒ 3 replication factor
○ Not relevant in external tables
● Does Not have Standby Master node
● Best for transient cluster (goes up and down every night)
EMR
● Use SCALA
● If you use Spark SQL , even in python it is ok.
● For code - python is full of bugs, connects well to R
● Scala is better - but not connects easily to data science  R
● Use cloud formation when ready to deploy fast.
● Check instance I ,
● Dense instance is good architecture
● User spot instances - for the tasks.
● Use TAGs
● Constant upgrade of the AMI version
● Don't use TEZ
● Make sure your choose instance with network optimized
● Resize cluster is not recommended
● Bootstrap to automate cluster upon provisioning
● Steps to automate steps on running cluster
● Use RDS to share Hive MetaStore (the metastore is mysql based)
● Use r, kafka and impala via bootstrap actions and many others
CPU features for instances
Sending work to emr
● Steps -
○ can be added while cluster is running
○ Can be added from the UI / CLI
○ FIFO scheduler by default
● Emr api
● Ganglia: jvm.JvmMetrics.MemHeapUsedM
Landscape
Hive
● SQL over hadoop.
● Engine: spark, tez, ML
● JDBC / ODBC
● Not good when need to shuffle.
● Connect well with DynamoDB
● SerDe json, parquet,regex etc.
Hbase
● Nosql like cassandra
● Below the Yarn
● Agent of HBASE on each data node. Like impala.
● Write on HDFS
● Multi level Index.
● (AWS avoid talking about it b/c of Dynamo, used only when you want to save
money)
● Driver from hive (work from hive on top of HBase)
Presto
● Like Hive, from facebook.
● Not good always for join on 2 large tables.
● Limited by memory
● Not fault tolerant like hive.
● Optimized for ad hoc queries
Pig
● Distributed Shell scripting
● Generating SQL like operations.
● Engine: MR, Tez
● S3, DynamoDB access
● Use Case: for data science who don't know SQL, for system people, for those
who want to avoid java/scala
● Fair fight compared to hive in term of performance only
● Good for unstructured files ETL : file to file , and use sqoop.
Spark
Mahut
● Machine library for hadoop - Not used
● Spark ML lib is used instead.
R
● Open source package for statistical computing.
● Works with EMR
● “Matlab” equivalent
● Works with spark
● Not for developer :) for statistician
● R is single threaded - use spark R to distribute. Not everything works perfect.
Apache Zeppelin
● Notebook - visualizer
● Built in spark integration
● Interactive data analytics
● Easy collaboration.
● Uses SQL
● work s on top of Hive
● Inside EMR.
● Give more feedback to let you know where u are
Hue● Hadoop user experience
● Logs in real time and failures.
● Multiple users
● Native access to S3.
● File browser to HDFS.
● Manipulate metascore
● Job Browser
● Query editor
● Hbase browser
● Sqoop editor, oozier editor, Pig Editor
Spark
● In memory
● X10 to X100 times faster
● Good optimizer for distribution
● Rich API
● Spark SQL
● Spark Streaming
● Spark ML (ML lib)
● Spark GraphX (DB graphs)
● SparkR
Spark
● RDD
○ An array (data set)
○ Read only distributed objects cached in mem across cluster
○ Allow apps to keep working set in mem for reuse
○ Fault tolerant
○ Ops:
■ Transformation: map/ filter / groupby / join
■ Actions: count / reduce/ collet / save / persist
○ Object oreinted ops
○ High level expression liken lambda, funcations, map
Data Frames
● Dataset containing named columns
● Access ordinal positions
● Tungsten execution backed - framework for distributed memory (open
source?)
● Abstraction for selecting , filtering, agg, and plot structure data.
● Run sql on it.
Streaming
● Near real time (1 sec latency) , like batch of 1sec windows
● Same spark as before
● Streaming jobs with API
Spark ML
● Classification
● Regression
● Collaborative filtering
● Clustering
● Decomposition
● Code: java, scala, python, sparkR
Spark GraphX
● Works with graphs for parallel computations
● Access to access of library of algorithm
○ Page rank
○ Connected components
○ Label propagation
○ SVD++
○ Strongly connected components
Hive on Spark
● Will replace tez.
Spark flavours
● Own cluster
● With yarn
● With mesos
Downside
● Compute intensive
● Performance gain over mapreduce is not guaranteed.
● Streaming processing is actually batch with very small window.
Redshift
● OLAP, not OLTP→ analytics , not transaction
● Fully SQL
● Fully ACID
● No indexing
● Fully managed
● Petabyte Scale
● MPP
● Can create slow queue for queries which are long lasting.
● DO NOT USE FOR transformation.
EMR vs Redshift
● How much data loaded and unloaded?
● Which operations need to performed?
● Recycling data? → EMR
● History to be analyzed again and again ? → emr
● What the data needs to end up? BI?
● Use spectrum in some use cases.
● Raw data? s3.
Hive VS. Redshift
● Amount of concurrency ? low → hive, high → redshift
● Access to customers? Redshift?
● Transformation, Unstructured , batch, ETL → hive.
● Peta scal ? redshift
● Complex joins → Redshift
Presto VS redshift
● Not a true DW , but can be used.
● Require S3, or HDFS
● Netflix uses presoto for analytics.
Redshift
● Leader node
○ Meta data
○ Execution plan
○ Sql end point
● Data node
● Distribution key
● Sort key
● Normalize…
● Dont be afraid to duplicate data with different sort key
Redshift Spectrum
● External table to S3
● Additional compute resources to redshift cluster.
● Not good for all use cases
Cost consideration
● Region
● Data out
● Spot instances
○ 50% - 80% cost reduction.
○ Limit your bid
○ Work well with EMR. use spot instances for task core mostly. For dev - use spot.
○ May be killed in the middle :)
● reserved instances
Kinesis pricing
● Streams
○ Shard hour
○ Put payload unit , 25KB
● Firehose
○ Volume
○ 5KB
● Analytics
○ Pay per container , 1 cpu, 4 GB = 11 Cents
Dynamo
● Most expensive: Pay for throughput
○ 1k write
○ 4k read
○ Eventually consistent is cheaper
● Be carefull… from 400$ to 10000$ simply b/s you change block size.
● You pay for storage, until 25GB free, the rest is 25 cent per GB
Optimize cost
● Alerts for cost mistakes-
○ unused machines etc.
○ on 95% from expected cost. → something is wrong…
● No need to buy RI for 3 year, 1year is better.
Visualizing - Quick Sight
● Cheap
● Connect to everything
○ S3
○ Dynamo
○ Redhsift
● SPICE
● Use storyboard to create slides of graphs like PPT
Tableau
● Connect to Redshift
● Good look and feel
● ~1000$ per user
● Not the best in term of performance. Quick sight is faster.
Other visualizer
● Tibco - Spot file
○ Has many connector
○ Had recommendations feature
● Jaspersoft
○ Limited
● ZoomData
● Hunk -
○ Users EMR and S3
○ Scehmoe on the fly
○ Available in market place
○ Expensive.
○ Non SQL , has it own language.
Visualize - which DB to choose?
● Hive is not recommended
● Presto a bit faster.
● Athena is ok
● Impala is better (has agent per machine, caching)
● Redshift is best.
Orchestration
● Oozie
○ Opens source workflow
■ Workflow: graph of action
■ Coordinator: scheduler jobs
○ Support: hive, sqoop , spark etc.
● Data pipeline
○ Move data from on prem to cloud
○ Distributed
○ Integrate well with s3, dynamodb, RDS, EMR, EC2, redshift
○ Like ETL: Input, data manipulation, output
○ Not trivial, but nicer than Oozie
● Other: AirFlow, Knime, Luigi, Azkaban
Security
● Shared security model
○ Customer:
■ OS, platform, identity, access management,role, permissions
■ Network
■ Firewall
○ AWS:
■ compute, storage, databases, networking, LB
■ Regions, AZ, edge location
■ compliance
EMR secured
● IAM
● VPC
● Private subnet
● VPC endpoint to S3
● MFA to login
● STS + SAML - token to login. Complex solution
● Kerberos authentication of Hadoop nodes
● Encryption
○ SSH to master
○ TLS between nodes
IAM
● User
● Best practice :
○ MFA
○ Don't use ROOT account
● Group
● Role
● Policy best practice
○ Allow X
○ Disallow all the rest
● Identity federation via LDAP
Kinesis Security best practice
● Create IAM admin for admin
● Create IAM entity for re sharing stream
● Create IAM entity for produces to write
● Create iam entity for consumer to read.
● Allow specific source IP
● Enforce AWS:SecureTransport condition key for every API call.
● Use temp credentials : IAM role
Dynamo Security
● Access
● Policies
● Roels
● Database level access level - row level (item)/ Col level (attribute) access control
● STS for web identity federation.
● Limit of amount row u can see
● Use SSL
● All request can be signed bia SHA256
● PCI, SOC3, HIPPA, 270001 etc.
● Cloud trail - for Audit log
Redshift Security
● SSL in transit
● Encryption
○ KMSAES256
○ HSM encryption (hardware)
● VPC
● Cloud Trail
● All the usual regulation certification.
● Security groups
Big Data patterns
● Interactive query → mostly EMR, Athena
● Batch Processing → reporting → redshift + EMR
● Stream processing → kinesis, kinesis client library, lambda, EMR
● Real time prediction → mobile, dynamoDB. → lambda+ ML.
● Batch prediction → ML and redshift
● Long running cluster → s3 → EMR
● Log aggregation → s3 → EMR → redshift
Stay in touch...
● Omid Vahdaty
● +972-54-2384178

More Related Content

What's hot

ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Data Con LA
 
Hadoop Networking at Datasift
Hadoop Networking at DatasiftHadoop Networking at Datasift
Hadoop Networking at Datasift
huguk
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
Michael Stack
 
Cassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsCassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analytics
Anirvan Chakraborty
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
Yiwei Ma
 

What's hot (20)

ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
 
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleZeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
 
Hadoop Networking at Datasift
Hadoop Networking at DatasiftHadoop Networking at Datasift
Hadoop Networking at Datasift
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ Flipboard
 
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoPostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streaming
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Cassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsCassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analytics
 
Signal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide RowsSignal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide Rows
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Demystifying the Distributed Database Landscape
Demystifying the Distributed Database LandscapeDemystifying the Distributed Database Landscape
Demystifying the Distributed Database Landscape
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 

Similar to Introduction to AWS Big Data

AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
Zhenxiao Luo
 

Similar to Introduction to AWS Big Data (20)

Effectively deploying hadoop to the cloud
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloud
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Scale 10x 01:22:12
Scale 10x 01:22:12Scale 10x 01:22:12
Scale 10x 01:22:12
 
Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
 
In-memory database
In-memory databaseIn-memory database
In-memory database
 
Strata - 03/31/2012
Strata - 03/31/2012Strata - 03/31/2012
Strata - 03/31/2012
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache Cassandra
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Redis as a Main Database, Scaling and HA
Redis as a Main Database, Scaling and HARedis as a Main Database, Scaling and HA
Redis as a Main Database, Scaling and HA
 
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark
 

More from Omid Vahdaty

More from Omid Vahdaty (20)

Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Couchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data Demystified
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
 
The technology of fake news between a new front and a new frontier | Big Dat...
The technology of fake news  between a new front and a new frontier | Big Dat...The technology of fake news  between a new front and a new frontier | Big Dat...
The technology of fake news between a new front and a new frontier | Big Dat...
 
Making your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedMaking your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data Demystified
 
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
 
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data Demystified
 
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
 
AWS Big Data Demystified #4 data governance demystified [security, networ...
AWS Big Data Demystified #4   data governance demystified   [security, networ...AWS Big Data Demystified #4   data governance demystified   [security, networ...
AWS Big Data Demystified #4 data governance demystified [security, networ...
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Emr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystified
 
Zeppelin and spark sql demystified
Zeppelin and spark sql demystifiedZeppelin and spark sql demystified
Zeppelin and spark sql demystified
 
Aws s3 security
Aws s3 securityAws s3 security
Aws s3 security
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 
Introduction to aws dynamo db
Introduction to aws dynamo dbIntroduction to aws dynamo db
Introduction to aws dynamo db
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
 
Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process
 
Cloud Architecture best practices
Cloud Architecture best practicesCloud Architecture best practices
Cloud Architecture best practices
 

Recently uploaded

Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
drjose256
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Lovely Professional University
 
1893-part-1-2016 for Earthquake load design
1893-part-1-2016 for Earthquake load design1893-part-1-2016 for Earthquake load design
1893-part-1-2016 for Earthquake load design
AshishSingh1301
 
electrical installation and maintenance.
electrical installation and maintenance.electrical installation and maintenance.
electrical installation and maintenance.
benjamincojr
 
Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..
MaherOthman7
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
rahulmanepalli02
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
Madan Karki
 
Seizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksSeizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networks
IJECEIAES
 

Recently uploaded (20)

Dynamo Scripts for Task IDs and Space Naming.pptx
Dynamo Scripts for Task IDs and Space Naming.pptxDynamo Scripts for Task IDs and Space Naming.pptx
Dynamo Scripts for Task IDs and Space Naming.pptx
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
 
Augmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxAugmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptx
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
 
1893-part-1-2016 for Earthquake load design
1893-part-1-2016 for Earthquake load design1893-part-1-2016 for Earthquake load design
1893-part-1-2016 for Earthquake load design
 
Interfacing Analog to Digital Data Converters ee3404.pdf
Interfacing Analog to Digital Data Converters ee3404.pdfInterfacing Analog to Digital Data Converters ee3404.pdf
Interfacing Analog to Digital Data Converters ee3404.pdf
 
Raashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashid final report on Embedded Systems
Raashid final report on Embedded Systems
 
electrical installation and maintenance.
electrical installation and maintenance.electrical installation and maintenance.
electrical installation and maintenance.
 
Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
 
Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)
 
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
Vip ℂall Girls Karkardooma Phone No 9999965857 High Profile ℂall Girl Delhi N...
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
 
Geometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdfGeometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdf
 
Electrical shop management system project report.pdf
Electrical shop management system project report.pdfElectrical shop management system project report.pdf
Electrical shop management system project report.pdf
 
Seizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksSeizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networks
 
"United Nations Park" Site Visit Report.
"United Nations Park" Site  Visit Report."United Nations Park" Site  Visit Report.
"United Nations Park" Site Visit Report.
 
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
 

Introduction to AWS Big Data

  • 1. Introduction to AWS Big Data Omid Vahdaty, Big Data Ninja
  • 2. When the data outgrows your ability to process ● Volume ● Velocity ● Variety
  • 3. EMR ● Basic, the simplest distribution of all hadoop distributions. ● Not as good as cloudera.
  • 5. Store ● S3 ● Kinesis ● RDS ● DynmoDB ● Clloud Search ● Iot
  • 6. Process + Analyze ● Lambda ● EMR ● Redshift ● Machine learning ● Elastic search ● Data Pipeline ● Athena
  • 7. Visulize ● Quick sight ● ElasticSearch service
  • 8. History of tools HDFS - from google Cassandra - from facebook, Columnar store NoSQL , materialized view, secondary index Kafka - linkedin. Hbase, no sql , hadoop eco system.
  • 9. EMR Hadoop ecosystem ● Spark - recommend options. Can do all the below… ● Hive ● Oozie ● Mahout - Machine library (mllib is better, runs on spark) ● Presto, better than hive? More generic than hive. ● Pig - scripting for big data. ● Impala - Cloudera, and part of EMR.
  • 10. ETL Tools ● Attunity ● Splunk ● Semarchy ● Informatica ● Tibco ● Clarit
  • 11. Architecture ● Decouple : ○ Store ○ Process ○ Store ○ Process ○ insight... ● Rule of thumb: 3 tech in dc, 7 techs max in cloud ○ Do use more b/c: maintenance
  • 12. Architecture considerations ● unStructured? Structured? Semi structured? ● Latency? ● Throughput? ● Concurrency? ● Access patterns? ● Pass? Max 7 technologies ● Iaas? Max 4 technologies
  • 13. EcoSystem ● Redshift = analytics ● Aurora = OLTP ● DynamoDB = NoSQL like mongodb. ● Lambda ● SQS ● Cloud Search ● Elastic Search ● Data Pipeline ● Beanstalk - i have jar, install it for me… ● AWS machine learning
  • 14. Data ingestions Architecture challenges ● Durability ● HA ● Ingestions types: ○ Batch ○ Stream ● Transaction: OLTP, noSQL.
  • 15. Ingestion options ● Kinesis ● Flume ● Kafka ● S3DistCP copy from ○ S3 to HDFS ○ S3 to s3 ○ Cross account ○ Supports compression.
  • 16. Transfer ● VPN ● Direct Connect ● S3 multipart upload ● Snowball ● IoT
  • 17. Steaming ● Streaming ● Batch ● Collect ○ Kinesys ○ DynamoDB streams ○ SQS (pull) ○ SNS (push) ○ KAFKA (Most recommend)
  • 19. Streaming ● Low latency ● Message delivery ● Lambda architecture implementation ● State management ● Time or count based windowing support ● Fault tolerant
  • 21. Stream collection options Kinesis client library AWS lambda EMR Third party Spark streaming (latency min =1 sec) , near real time, with lot of libraries. Storm - Most real time (sub millisec), java code based. Flink (similar to spark)
  • 22. Kinesys ● Stream - collect@source and near real time processing ○ Near real time ○ High throughput ○ Low cost ○ Easy administration - set desired level of capacity ○ Delivery to : s3,redshift, Dynamo ○ Ingress 1mb, egress 2mbs. Upto 1000 Transaction per second. ● Analytics - in flight analytics. ● Firehose - Park you data @ destination.
  • 23.
  • 24. Kinesis analytics example CREATE STREAM "INTERMEDIATE_STREAM" ( hostname VARCHAR(1024), logname VARCHAR(1024), username VARCHAR(1024), requesttime VARCHAR(1024), request VARCHAR(1024), status VARCHAR(32), responsesize VARCHAR(32) ); -- Data Pump: Take incoming data from SOURCE_SQL_STREAM_001 and insert into INTERMEDIATE_STREAM
  • 25. KCL ● Read from stream using get api ● Build application with the KCL ● Leverage kinesis spout for storm ● Leverage EMR connector
  • 26. Firehose - for parking ● Not for fast lane - no in flight analytics ● Capture , transform and load. ○ Kinesis ○ S3 ○ Redshift ○ elastic search ● Managed Service ● Producer - you input to delivery stream ● Buffer size MB ● Buffer in seconds.
  • 27. Comparison of Kinesis product ● Stream ○ Sub 1 sec processing latency ○ Choice of stream processor (generic) ○ For smaller events ● Firehose ○ Zero admin ○ 4 targets built in (redshift, s3, search, etc) ○ Latency 60 sec minimum. ○ For larger “events”
  • 28. DynamoDB ● Fully managed NoSql document store key value ○ Tables have not fixed schema ● High performance ○ Single digit latency ○ Runs on solid drives ● Durable ○ Multi az ○ Fault toulerant, replicated 3 AZ.
  • 29. Durability ● Read ○ Eventually consistent ○ Strongly consistent ● Write ○ Quorum ack ○ 3 replication - always - we can’t change it. ○ Persistence to disk.
  • 30. Indexing and partitioning ● Idnexing ○ LSI - local secondary index, kind of alternate range key ○ GSI - global secondary index - “pivot charts” for your table, kind of projection (vertica) ● Partitioning ○ Automatic ○ Hash key speared data across partitions
  • 31. DynamoDB Items and Attributes ● Partitions key ● Sort key (optional) ● LSI ● attributes
  • 32. AWS Titan(on DynamoDB) - Graph DB ● Vertex - nodes ● Edge- relationship b/w nodes. ● Good when you need to investigate more than 2 layer relationship (join of 4 tables) ● Based on TinkerPop (open source) ● Full text search with lucene SOLR or elasticsearch ● HA using multi master replication (cassandra backed) ● Scale using DynamoDB backed. ● Use cases: Cyber , Social networks, Risk management
  • 33. Elasticache: MemCache / Redis ● Cache sub millisecond. ● In memory ● No disk I/O when querying ● High throughput ● High availability ● Fully managed
  • 34. Redshift ● Peta scale database for analytics ○ Columnar ○ MPP ● Complex SQL
  • 35. RDS ● Flavours ○ Mysql ○ Auraora ○ postgreSQL ○ Oracle ○ MSSQL ● Multi AZ, HA ● Managed services
  • 36. Data Processing ● Batch ● Ad hoc queries ● Message ● Stream ● Machine learning ● (by the use parquet, not ORC, more commonly used in the ecosystem)
  • 37. Athena ● Presto ● In memory ● Hive meta store for DDL functionality ○ Complex data types ○ Multiple formats ○ Partitions
  • 38. EMR ● Emr , pig and hive can work on top of spark , but not yet in EMR ● TEZ is not working. Horton gave up on it. ● Can ran presto on hive
  • 39. Best practice ● Bzip2 - splittable compression ○ You don't have to open all the file, you can bzip2 only one block in it and work in parallel ● Snappy - encoding /decoding time - VERY fast, not compress well. ● Partitions ● Ephemeral EMR
  • 40. EMR ecosystem ● Hive ● Pig ● Hue ● Spark ● Oozie ● Presto ● Ganglia ● Zookeeper ● zeppelin ● For research - Public data sets exist of AWS...
  • 41. EMR ARchitecture ● Master node ● Core nodes - like data nodes (with storage) ● Task nodes - (not like regular hadoop, extends compute) ● Default replication factor ○ Nodes 1-3 ⇒ 1 factor ○ Nodes 4-9 ⇒ 2 replication factor ○ Nodes 10+ ⇒ 3 replication factor ○ Not relevant in external tables ● Does Not have Standby Master node ● Best for transient cluster (goes up and down every night)
  • 42. EMR ● Use SCALA ● If you use Spark SQL , even in python it is ok. ● For code - python is full of bugs, connects well to R ● Scala is better - but not connects easily to data science R ● Use cloud formation when ready to deploy fast. ● Check instance I , ● Dense instance is good architecture ● User spot instances - for the tasks. ● Use TAGs ● Constant upgrade of the AMI version ● Don't use TEZ ● Make sure your choose instance with network optimized ● Resize cluster is not recommended ● Bootstrap to automate cluster upon provisioning ● Steps to automate steps on running cluster ● Use RDS to share Hive MetaStore (the metastore is mysql based) ● Use r, kafka and impala via bootstrap actions and many others
  • 43. CPU features for instances
  • 44. Sending work to emr ● Steps - ○ can be added while cluster is running ○ Can be added from the UI / CLI ○ FIFO scheduler by default ● Emr api ● Ganglia: jvm.JvmMetrics.MemHeapUsedM
  • 46. Hive ● SQL over hadoop. ● Engine: spark, tez, ML ● JDBC / ODBC ● Not good when need to shuffle. ● Connect well with DynamoDB ● SerDe json, parquet,regex etc.
  • 47. Hbase ● Nosql like cassandra ● Below the Yarn ● Agent of HBASE on each data node. Like impala. ● Write on HDFS ● Multi level Index. ● (AWS avoid talking about it b/c of Dynamo, used only when you want to save money) ● Driver from hive (work from hive on top of HBase)
  • 48. Presto ● Like Hive, from facebook. ● Not good always for join on 2 large tables. ● Limited by memory ● Not fault tolerant like hive. ● Optimized for ad hoc queries
  • 49. Pig ● Distributed Shell scripting ● Generating SQL like operations. ● Engine: MR, Tez ● S3, DynamoDB access ● Use Case: for data science who don't know SQL, for system people, for those who want to avoid java/scala ● Fair fight compared to hive in term of performance only ● Good for unstructured files ETL : file to file , and use sqoop.
  • 50. Spark
  • 51. Mahut ● Machine library for hadoop - Not used ● Spark ML lib is used instead.
  • 52. R ● Open source package for statistical computing. ● Works with EMR ● “Matlab” equivalent ● Works with spark ● Not for developer :) for statistician ● R is single threaded - use spark R to distribute. Not everything works perfect.
  • 53. Apache Zeppelin ● Notebook - visualizer ● Built in spark integration ● Interactive data analytics ● Easy collaboration. ● Uses SQL ● work s on top of Hive ● Inside EMR. ● Give more feedback to let you know where u are
  • 54. Hue● Hadoop user experience ● Logs in real time and failures. ● Multiple users ● Native access to S3. ● File browser to HDFS. ● Manipulate metascore ● Job Browser ● Query editor ● Hbase browser ● Sqoop editor, oozier editor, Pig Editor
  • 55. Spark ● In memory ● X10 to X100 times faster ● Good optimizer for distribution ● Rich API ● Spark SQL ● Spark Streaming ● Spark ML (ML lib) ● Spark GraphX (DB graphs) ● SparkR
  • 56. Spark ● RDD ○ An array (data set) ○ Read only distributed objects cached in mem across cluster ○ Allow apps to keep working set in mem for reuse ○ Fault tolerant ○ Ops: ■ Transformation: map/ filter / groupby / join ■ Actions: count / reduce/ collet / save / persist ○ Object oreinted ops ○ High level expression liken lambda, funcations, map
  • 57. Data Frames ● Dataset containing named columns ● Access ordinal positions ● Tungsten execution backed - framework for distributed memory (open source?) ● Abstraction for selecting , filtering, agg, and plot structure data. ● Run sql on it.
  • 58. Streaming ● Near real time (1 sec latency) , like batch of 1sec windows ● Same spark as before ● Streaming jobs with API
  • 59. Spark ML ● Classification ● Regression ● Collaborative filtering ● Clustering ● Decomposition ● Code: java, scala, python, sparkR
  • 60. Spark GraphX ● Works with graphs for parallel computations ● Access to access of library of algorithm ○ Page rank ○ Connected components ○ Label propagation ○ SVD++ ○ Strongly connected components
  • 61. Hive on Spark ● Will replace tez.
  • 62. Spark flavours ● Own cluster ● With yarn ● With mesos
  • 63. Downside ● Compute intensive ● Performance gain over mapreduce is not guaranteed. ● Streaming processing is actually batch with very small window.
  • 64. Redshift ● OLAP, not OLTP→ analytics , not transaction ● Fully SQL ● Fully ACID ● No indexing ● Fully managed ● Petabyte Scale ● MPP ● Can create slow queue for queries which are long lasting. ● DO NOT USE FOR transformation.
  • 65. EMR vs Redshift ● How much data loaded and unloaded? ● Which operations need to performed? ● Recycling data? → EMR ● History to be analyzed again and again ? → emr ● What the data needs to end up? BI? ● Use spectrum in some use cases. ● Raw data? s3.
  • 66. Hive VS. Redshift ● Amount of concurrency ? low → hive, high → redshift ● Access to customers? Redshift? ● Transformation, Unstructured , batch, ETL → hive. ● Peta scal ? redshift ● Complex joins → Redshift
  • 67. Presto VS redshift ● Not a true DW , but can be used. ● Require S3, or HDFS ● Netflix uses presoto for analytics.
  • 68. Redshift ● Leader node ○ Meta data ○ Execution plan ○ Sql end point ● Data node ● Distribution key ● Sort key ● Normalize… ● Dont be afraid to duplicate data with different sort key
  • 69. Redshift Spectrum ● External table to S3 ● Additional compute resources to redshift cluster. ● Not good for all use cases
  • 70. Cost consideration ● Region ● Data out ● Spot instances ○ 50% - 80% cost reduction. ○ Limit your bid ○ Work well with EMR. use spot instances for task core mostly. For dev - use spot. ○ May be killed in the middle :) ● reserved instances
  • 71. Kinesis pricing ● Streams ○ Shard hour ○ Put payload unit , 25KB ● Firehose ○ Volume ○ 5KB ● Analytics ○ Pay per container , 1 cpu, 4 GB = 11 Cents
  • 72. Dynamo ● Most expensive: Pay for throughput ○ 1k write ○ 4k read ○ Eventually consistent is cheaper ● Be carefull… from 400$ to 10000$ simply b/s you change block size. ● You pay for storage, until 25GB free, the rest is 25 cent per GB
  • 73. Optimize cost ● Alerts for cost mistakes- ○ unused machines etc. ○ on 95% from expected cost. → something is wrong… ● No need to buy RI for 3 year, 1year is better.
  • 74. Visualizing - Quick Sight ● Cheap ● Connect to everything ○ S3 ○ Dynamo ○ Redhsift ● SPICE ● Use storyboard to create slides of graphs like PPT
  • 75. Tableau ● Connect to Redshift ● Good look and feel ● ~1000$ per user ● Not the best in term of performance. Quick sight is faster.
  • 76. Other visualizer ● Tibco - Spot file ○ Has many connector ○ Had recommendations feature ● Jaspersoft ○ Limited ● ZoomData ● Hunk - ○ Users EMR and S3 ○ Scehmoe on the fly ○ Available in market place ○ Expensive. ○ Non SQL , has it own language.
  • 77. Visualize - which DB to choose? ● Hive is not recommended ● Presto a bit faster. ● Athena is ok ● Impala is better (has agent per machine, caching) ● Redshift is best.
  • 78. Orchestration ● Oozie ○ Opens source workflow ■ Workflow: graph of action ■ Coordinator: scheduler jobs ○ Support: hive, sqoop , spark etc. ● Data pipeline ○ Move data from on prem to cloud ○ Distributed ○ Integrate well with s3, dynamodb, RDS, EMR, EC2, redshift ○ Like ETL: Input, data manipulation, output ○ Not trivial, but nicer than Oozie ● Other: AirFlow, Knime, Luigi, Azkaban
  • 79. Security ● Shared security model ○ Customer: ■ OS, platform, identity, access management,role, permissions ■ Network ■ Firewall ○ AWS: ■ compute, storage, databases, networking, LB ■ Regions, AZ, edge location ■ compliance
  • 80. EMR secured ● IAM ● VPC ● Private subnet ● VPC endpoint to S3 ● MFA to login ● STS + SAML - token to login. Complex solution ● Kerberos authentication of Hadoop nodes ● Encryption ○ SSH to master ○ TLS between nodes
  • 81. IAM ● User ● Best practice : ○ MFA ○ Don't use ROOT account ● Group ● Role ● Policy best practice ○ Allow X ○ Disallow all the rest ● Identity federation via LDAP
  • 82. Kinesis Security best practice ● Create IAM admin for admin ● Create IAM entity for re sharing stream ● Create IAM entity for produces to write ● Create iam entity for consumer to read. ● Allow specific source IP ● Enforce AWS:SecureTransport condition key for every API call. ● Use temp credentials : IAM role
  • 83. Dynamo Security ● Access ● Policies ● Roels ● Database level access level - row level (item)/ Col level (attribute) access control ● STS for web identity federation. ● Limit of amount row u can see ● Use SSL ● All request can be signed bia SHA256 ● PCI, SOC3, HIPPA, 270001 etc. ● Cloud trail - for Audit log
  • 84. Redshift Security ● SSL in transit ● Encryption ○ KMSAES256 ○ HSM encryption (hardware) ● VPC ● Cloud Trail ● All the usual regulation certification. ● Security groups
  • 85. Big Data patterns ● Interactive query → mostly EMR, Athena ● Batch Processing → reporting → redshift + EMR ● Stream processing → kinesis, kinesis client library, lambda, EMR ● Real time prediction → mobile, dynamoDB. → lambda+ ML. ● Batch prediction → ML and redshift ● Long running cluster → s3 → EMR ● Log aggregation → s3 → EMR → redshift
  • 86. Stay in touch... ● Omid Vahdaty ● +972-54-2384178