Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017

61 views

Published on

Big Data doesn’t have to just mean Hadoop any more. Big Data can be done in the cloud, using tools developed by the Cloud providers. This session will cover using Amazon AWS services to implement a Big Data application. We will compare and contrast different services from Amazon with the Hadoop equivalents.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017

  1. 1. Confidential and Proprietary to Daugherty Business Solutions So You Don't Have an Admin Team - Doing Big Data using Amazon's analogs Adam Doyle Stampedecon 2017
  2. 2. Confidential & Proprietary to Daugherty Business Solutions. EIM and Analytics Data Science • Predictive and Prescriptive Analytics • Social, Text and Sentiment Analytics • Natural Language Processing • Machine Learning, Artificial Intelligence • SPSS, SAS, R, IBM Watson™ Strategy and Competency Building • Build the right, comprehensive solution blueprint across 12 Domains • Establish, specific, actionable plan and ROIs • Protecting your investments • Organization, Talent, Competency • Processes, Methods, Techniques, Tools • Speed – Agile EIM Transformation • Governance processes Customer and Business Analytics • Customer/Buyer/Channel Segmentation • Persona Development, Customer Scoring (Value, Potential) • Attrition Modeling, Engagement and Response Modeling • Inventory Management, Marketing Campaigns • Product Design Analytics, Workforce Planning, Location Based Advertising • Data Monetization Traditional Data Warehouse and Business Intelligence • EDW, ODS, Data Mart and Integration • Master Data Management • Data Governance • Dashboards, Scorecards, • Reports , Alerts • Multidimensional Analysis • Ad hoc slicing and dicing • Self Service Enablement • Cloud Migration and Agile EIM ANALYTICS STRATEGY EIM and ANALYTICS 400+ employees strong Digital Engagement/Analytics • Customer Engagement Strategies • Omni-channel and Integrated Marketing • Strategic Planning, Building and Executing Digital and Customer Engagement Solutions. Big Data and Next Generation Technologies • Data Lab Development Centers • Data Lakes, Analytic Platforms • Hadoop (Cloudera, Hortonworks) • NoSQL / Graph DB (MongoDB, DataStax • Cloud platforms (AWS, Google, Azure) • Spark, Sqoop, Hive, Pig, Kafka, etc.
  3. 3. Confidential and Proprietary to Daugherty Business Solutions • 20 year veteran of the St. Louis IT community • Co-Organizer, St. Louis Hadoop User Group • Big Data Community Lead, Daugherty Business Solutions • Formerly Big Data Solution Architect at Amitech, Lead Big Data developer at Mercy • Speaker at local and national Big Data conferences Meet Adam Doyle 3
  4. 4. Confidential and Proprietary to Daugherty Business Solutions You are developing an Internet of Things solution for a kitchen appliance manufacturer. Essentially you are trying to answer the eternal question: 4 Problem Statement http://www.routercheck.com/2014/01/27/is-your-refrigerator-running/
  5. 5. Confidential and Proprietary to Daugherty Business Solutions • Great! You’ve got options – Hadoop on EC2 with a distribution – Hadoop on EMR with a distribution – Hadoop on EMR with Amazon’s Hadoop version Let’s say you wanted to do Big Data on AWS
  6. 6. Confidential and Proprietary to Daugherty Business Solutions Hadoop 6 So what would that look like? API Client Flume Client Kafka Client Kafka Spark Streaming HBase Hive HDFS Mahout Spark MLIB Spark SQL SOLR API Client Flume Client Kafka Client NiFi
  7. 7. Confidential and Proprietary to Daugherty Business Solutions • Virtual machines in the cloud • Choice of many different options – Operating system – Processors – Memory – Disk sizes • Can be created in minutes • Can be created through code • Can be turned off when not needed to reduce costs 7 EC2
  8. 8. Confidential and Proprietary to Daugherty Business Solutions • All of these options require that you have a Hadoop administrator that can tweak the installation for performance. • Your servers generally need to be up and running, so you are paying for them even when they are not heavily utilized. 8 The downsides
  9. 9. Confidential and Proprietary to Daugherty Business Solutions • You can use Amazon’s services to roll your own Big Data application 9 Or … http://www.writingfordesigners.com/?p=19906
  10. 10. Confidential and Proprietary to Daugherty Business Solutions 10 Ingest AWS API Gateway Flume Client AWS IoT Lambda Lambda Lambda AWS Greengrass
  11. 11. Confidential and Proprietary to Daugherty Business Solutions • Three step process to set up an API – Define the API – Create the client – Create the server • Wizard to help define the API • Connects to Lambda, DynamoDB, EC2, S3 11 API Gateway API Gateway API Client
  12. 12. Confidential and Proprietary to Daugherty Business Solutions • Serverless code execution • No servers to provision or manage • Event trigger based • You pay only for code execution time • Automatic scaling up to user defined thresholds • Currently only a few languages supported (Node.js, Java, Python, and C#) 12 Lambda Lambda
  13. 13. Confidential and Proprietary to Daugherty Business Solutions • Device Gateway • Message Broker • Rules Engine • Security and Identity Service • Thing Registry • Thing Shadow • Thing Shadows Service • Integrations with other AWS components • Processing SDK • Device SDK 13 AWS IoT AWS IoT
  14. 14. Confidential and Proprietary to Daugherty Business Solutions • Extends the functions of AWS IoT to intermittently connected devices • Devices connect to a local Greengrass core • Core connects to server when connection is present 14 AWS Greengrass NiFi AWS Greengrass
  15. 15. Confidential and Proprietary to Daugherty Business Solutions AWS Lambda Lambda Lambda 15 Processing data in real-time Lambda Kinesis Lambda SQS API Gateway AWS IoT Flume Client
  16. 16. Confidential and Proprietary to Daugherty Business Solutions • Publish/subscribe messaging service - topics • Dynamically resize consumer/publisher bandwidth • Cleans up after itself after 24 hours 16 Kinesis Kinesis Kafka
  17. 17. Confidential and Proprietary to Daugherty Business Solutions • Queue based service • Destructive reads 17 Simple Queue Service (SQS) Standard Queue FIFO Queue High throughput Limited throughput (300 TPS) At-Least-Once Delivery Exactly-Once Processing Best-Effort Ordering First-In-First-Out Delivery SQS Spark Streaming
  18. 18. Confidential and Proprietary to Daugherty Business Solutions • Scheduled batch operations • WYSIWYG editor 18 Data Pipeline
  19. 19. Confidential and Proprietary to Daugherty Business Solutions 19 Storing Data AWS API Client Flume Client Kafka Client Lambda Lambda Lambda Lambda Kinesis Lambda SQS API Gateway Flume Client AWS IoT S3 RDS Dynamo DB
  20. 20. Confidential and Proprietary to Daugherty Business Solutions • File storage in the cloud • Store file backups offsite • Host static websites • Highly available – 99.99% • Highly durable – 99.999999999% • Versioning can be turned on 20 S3 S3 HDFS
  21. 21. Confidential and Proprietary to Daugherty Business Solutions • Create a cloud-based RDBMS – Amazon Aurora – MySQL – MariaDB – PostgreSQL – Oracle – SQL Server • Costs based on type of engine, size of database, storage 21 RDS RDS Hive
  22. 22. Confidential and Proprietary to Daugherty Business Solutions • NoSQL Document Store • Handles sparse data • Pay for Read/Write Capacity and Storage 22 DynamoDB Dynamo DB HBase
  23. 23. Confidential and Proprietary to Daugherty Business Solutions 23 Analyzing Data AWS API Client Flume Client Kafka Client Lambda Lambda Lambda Lambda Kinesis Lambda SQS API Gateway Flume Client AWS IoT S3 RDS Dynamo DB Athena Redshift Machine Learning
  24. 24. Confidential and Proprietary to Daugherty Business Solutions Athena has a limited set of formats that it works with: • Apache Web Logs • CSV • TSV • Text File with Custom Delimiters • JSON • Parquet • ORC Advantages • Serverless • Scalable 24 Athena Athena Hive
  25. 25. Confidential and Proprietary to Daugherty Business Solutions • PostgreSQL compatible syntax with columnar storage • Designed for DWH/OLAP queries • Integrates with DynamoDB, S3, and Data Pipeline • Tunable concurrency limits 25 Redshift Redshift Spark SQL
  26. 26. Confidential and Proprietary to Daugherty Business Solutions • Offers three types of machine learning models: – Binary Classification – Multiclass Classification – Regression • Offers batch or synchronous modes 26 Machine Learning Machine Learning Mahout Spark MLIB
  27. 27. Confidential and Proprietary to Daugherty Business Solutions AWS API Client Flume Client Kafka Client Lambda Lambda Lambda Lambda Kinesis Lambda SQS API Gateway Flume Client AWS IoT S3 RDS Dynamo DB Athena Redshift Machine Learning 27 Search Elastic Search
  28. 28. Confidential and Proprietary to Daugherty Business Solutions • Amazon’s implementation of Elastic’s ElasticSearch product • Distributed JSON-based search analytics engine • Designed for Horizontal scalability, reliability, and easy management • Combined with Logstash and Kibana to form the ELK stack 28 Elastic Search Elastic Search SOLR
  29. 29. Confidential and Proprietary to Daugherty Business Solutions • Scalability • Fault-tolerance • Security • Cost 29 Other concerns
  30. 30. Confidential and Proprietary to Daugherty Business Solutions • Adding more resources to AWS clusters can be done at the click of a button. • Most AWS services allow for additional resources to be added. Some allow for autoscaling. • Autoscaling can be used to limit the cost of cluster operation. 30 Scalability
  31. 31. Confidential and Proprietary to Daugherty Business Solutions • AWS services are designed to be self-healing. • The underlying data store for most applications is S3. 31 Fault-Tolerance
  32. 32. Confidential and Proprietary to Daugherty Business Solutions • Security for all of your cluster resources is managed by IAM (Identity and Access Management). • Policies can be set for each resource with fine-grained access control. • Arguably, this is one area where having a skilled administrator can be a great help. 32 Security
  33. 33. Confidential and Proprietary to Daugherty Business Solutions • You can perform cost calculations before using any services • You only pay for what you use (no contracts!) • But, you will get a better price if you used Reserved Instances (annual or multi-year contracts) • You can easily tie infrastructure costs to a product or department • There is a free tier that can be used for a year • You just need a credit card to get started 33 Cost
  34. 34. Confidential and Proprietary to Daugherty Business Solutions 34 And more
  35. 35. Confidential and Proprietary to Daugherty Business Solutions Join Our Team Contact: Adam.doyle@daugherty.com

×