Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(BDT210) Building Scalable Big Data Solutions: Intel & AOL

995 views

Published on

"Growing data is a massive computational challenge across the enterprise. The opportunity to draw insights from huge data sets is wide open, but traditional computing environments often can’t scale to those volumes. In this session, Intel Chief Data Scientist Bob Rogers PhD  explains how developers can take advantage of technologies from Intel with the AWS platform.

Also in this session, AOL Systems Architect Durga Nemani provides insights into how AOL was able to reduce the time and cost to process massive amounts of clickstream data by leveraging big data technologies in AWS.  AOL can process data as fast as possible or as cheaply as possible, depending on the SLA, by choosing the number and types of instances without any changes to the code.  Session sponsored by Intel."

Published in: Technology
  • Be the first to comment

(BDT210) Building Scalable Big Data Solutions: Intel & AOL

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Bob Rogers, PhD Chief Data Scientist for Big Data Solutions, Intel Durga Nemani, System Architect AOL Inc. October 2015 Building Scalable Big Data Solutions BDT210
  2. 2. Building Scalable Big Data Solutions October 2015 Bob Rogers, PhD Chief Data Scientist for Big Data Solutions Intel
  3. 3. @scientistBob 3 About me
  4. 4. @scientistBob What does Big Data have to do with Intel? Trusted Analytics Platform
  5. 5. @scientistBob 5 Intel contributions to Apache Hadoop Encryption Intel® AES-NI
  6. 6. @scientistBob 6 Use case: Assemble an accurate patient problem list Why? • To improve patient outcomes KPI • False negatives in problem list
  7. 7. @scientistBob 7 What does a patient look like to a data scientist?
  8. 8. @scientistBob 8 My first enterprise data hub
  9. 9. @scientistBob 0-25 % 25-50 % 50-75% 75-100 % Poll: What percent of the key clinical data to you think is missing from the problem list? ?
  10. 10. @scientistBob >63% Missing Poll: What percent of the key clinical data to you think is missing from the problem list?
  11. 11. @scientistBob Real patient example Coded Data Free Text Scanned Document s Other Data Silos
  12. 12. @scientistBob Missing information
  13. 13. @scientistBob 13 What did we learn? • Start with what you know • Leverage existing technologies • Use simple tools • Measure your results
  14. 14. @scientistBob Powerful Big Data analytics reveal the truth about your… …customers …products …ecosystem …opportunities 14
  15. 15. Thank you bob.rogers@intel.com @scientistBob 15
  16. 16. BuildingScalableBigData Solutions Durga Nemani – AOL Inc.
  17. 17. BACKGROUND& ARCHITECTURE
  18. 18. HYBRID
  19. 19. TheThree Vs • Volume • Multiple Terabytes per day • Variety • Delimited, Avro, JSON • Velocity • Hourly, Batch
  20. 20. Workload Management • “One size fits all” model does not work. • Specific infrastructure tuned to needs and requirements • Variety of EMR clusters as per Data need 2 0 Workloads with significant diversity of needs Resources with lowest common denominator Resources for workloads with significant diversity of needs
  21. 21. S3 EMR EMR EMR EMR
  22. 22. JSON EC2EMRS3 Apache HiveApache PigApache Hadoop Open Source Data Formats AWS Services Open Source Technologies Avro Parquet
  23. 23. UNIQUE FEATURES& ADVANTAGES
  24. 24. Separation of Computeand Storage
  25. 25. SEE,SPOT,SQUEEZE • Just enough spot instances to finish the job in 59 minutes.
  26. 26. KeyFeatures • Separation of Compute and Storage: Amazon S3 and Amazon EMR • Transient Clusters: No permanent cluster. Different size clusters for different datasets • Separation of duties: Independent jobs for Processing, Extracting, loading and monitoring. • Parallelism: Process the smallest chunk of data possible in parallel to reduce dependencies • Scalability: Hundreds of Amazon EMR clusters in multiple regions and Availability Zones • Cost optimized: All Spot instances. Launch in Availability Zone with lowest spot prices.
  27. 27. DATA& INSIGHTS
  28. 28. CLOUD Facts 2 8 Total Compressed Amazon S3 Data Size 150 TB Uncompressed RAW Data/Day 2-3 TB Amazon EMR Clusters/Day 350 Amazon S3 Data Retention Period 13-24 Months
  29. 29. 150 24,000 Restatement Use Case Terabytes raw 2 9 10 Availability Zone 550 EMR Clusters EC2 Instances
  30. 30. AWS COSTBREAKOUT 44% 40% 16% 3 0 ** Storage cost is recurring every month at 2.85$/100 GB EC2 Cost EMR Fee S3 Cost
  31. 31. BestPractices& Suggestions
  32. 32. Tag all resources Infrastructure as Code Command Line Interface JSON as configuration files AWS Identity and Access Management (IAM) roles and policies Use of application ID Enable CloudTrail S3 lifecycle management S3 versioning Separate code/data/logs buckets Keyless EMR clusters Hybrid model Enable debugging Create multiple CLI profiles Multi-factor authentication CloudWatch billing alarms EC2 Spot instances SNS notifications for failures Loosely coupled Apps Scale horizontally
  33. 33. NextSteps
  34. 34. 3 4 Database on cloud • Database on AWS • Options: Amazon RDS, Amazon Redshift, or others using Amazon EC2 Event-driven design • Kick off code based on events • Run downstream processes as soon as upstream completes • Options: AWS Lambda, Amazon SQS, Amazon SWF or AWS Data Pipeline Data analytics • Implement massive parallel processing technologies • Options: Spark, Impala or Presto DevOPS on cloud • Rapidly and automatically deploy new code • Continuous Integration/Continuous Deployment • Options: AWS CodeDeploy, AWS CodeCommit, or AWS CodePipeline
  35. 35. Q&A
  36. 36. THANKYOU Recommended session: BDT208 - A Technical Introduction to Amazon Elastic MapReduce Thursday, Oct 8, 12:15 PM - 1:15 PM – Titian 2201B
  37. 37. Remember to complete your evaluations!

×