Data & Analytics - Session 1 - Big Data Analytics

3,198 views

Published on

Learn more about the tools, techniques and technologies for working productively with data at any scale. This presentation introduces the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.

Jon Einkauf, Senior Product Manager, Elastic MapReduce, AWS
Alan Priestley, Marketing Manager, Intel and Bob Harris, CTO, Channel 4

Published in: Technology
0 Comments
13 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,198
On SlideShare
0
From Embeds
0
Number of Embeds
23
Actions
Shares
0
Downloads
316
Comments
0
Likes
13
Embeds 0
No embeds

No notes for slide

Data & Analytics - Session 1 - Big Data Analytics

  1. 1. Big Data AnalyticsJon EinkaufSenior Product Manager, Amazon Elastic MapReduce
  2. 2. 1. Introducing Big Data2. From data to actionable information3. Analytics and Cloud ComputingOverview
  3. 3. Introducing Big Data1
  4. 4. GenerationCollection & storageAnalytics & computationCollaboration & sharing
  5. 5. The cost of data generationis falling
  6. 6. GenerationCollection & storageAnalytics & computationCollaboration & sharingLower cost,higher throughput
  7. 7. GenerationCollection & storageAnalytics & computationCollaboration & sharingLower cost,higher throughputHighlyconstrained
  8. 8. Generated dataAvailable for analysisData volumeGartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  9. 9. Elastic and highly scalableNo upfront capital expenseOnly pay for what you use++Available on-demand+=Removeconstraints
  10. 10. GenerationCollection & storageAnalytics & computationCollaboration & sharingLower cost,higher throughputHighlyconstrained
  11. 11. GenerationCollection & storageAnalytics & computationCollaboration & sharingAccelerated
  12. 12. Technologies and techniques forworking productively with data,at any scale.Big Data
  13. 13. From data toactionable information2
  14. 14. “Who buys video games?”
  15. 15. 3.5 billion records13 TB of click stream logs71 million unique cookiesPer day:
  16. 16. 500% return on ad spend17,000% reduction in procurement timeResults:
  17. 17. “Who is using ourservice?”
  18. 18. Identified early mobile usageInvested heavily in mobile developmentFinding signal in the noise of logs
  19. 19. 9,432,061 unique mobile devicesused the Yelp mobile app.4 million+ calls. 5 million+ directions.In January 2013
  20. 20. Open web index.3.4 billion records.Available to all.
  21. 21. Full parse for impact ofsocial networks300 lines of Ruby code.14 hours.$100.
  22. 22. You Are What You Tweet: Analyzing Twitter for Public Health. M. J. Paul and M. Dredze, 2011Tweeting about Flu
  23. 23. Analytics andCloud Computing3
  24. 24. GenerationCollection & storageAnalytics & computationCollaboration & sharing
  25. 25. GenerationCollection & storageAnalytics & computationCollaboration & sharingS3, Glacier,Storage Gateway,DynamoDB,Redshift, RDS,HBase
  26. 26. GenerationCollection & storageAnalytics & computationCollaboration & sharingEC2 &Elastic MapReduce
  27. 27. GenerationCollection & storageAnalytics & computationCollaboration & sharingEC2 & S3,CloudFormation,Elastic MapReduce,RDS, DynamoDB, Redshift
  28. 28. GenerationCollection & storageAnalytics & computationCollaboration & sharingEC2 & S3,CloudFormation,Elastic MapReduce,RDS, DynamoDB, RedshiftEC2 &Elastic MapReduceS3, Glacier,Storage Gateway,DynamoDB,Redshift, RDS,HBaseAWS Data Pipeline
  29. 29. Elastic MapReduce
  30. 30. How does it work?EMREMR ClusterS31. Put the datainto S3 (or HDFS)3. Get theresults2. Launch your cluster.Choose:• Hadoop distribution• How many nodes• Node type (hi-CPU,hi-memory, etc.)• Hadoop apps (Hive,Pig, HBase)
  31. 31. EMREMR ClusterHow does it work?S3You caneasily resizethe cluster
  32. 32. EMREMR ClusterHow does it work?S3Use Spotnodes tosave timeand money
  33. 33. EMREMR ClusterHow does it work?S3Launch parallel clustersagainst the same datasource (tune for theworkload)
  34. 34. How does it work?EMR ClusterS3When the work is complete,you can terminate the cluster(and stop paying)
  35. 35. EMR ClusterHow does it work?You can storeeverything in HDFS(local disk)High Storage nodes= 48 TB/node
  36. 36. EMR ClusterHow does it work?Launch in a VirtualPrivate Cloud forextra security
  37. 37. Thousands of Customers, 5+ Million Clusters
  38. 38. Give it a try.Cost to run a 100-node EMR cluster:£4.90 / hour
  39. 39. AWS Data PipelineData-intensive orchestration and automationReliable and scheduledEasy to use, drag and dropExecution and retry logicMap data dependenciesCreate and manage temporary computeresources
  40. 40. Anatomy of a pipeline
  41. 41. Additional checks and notifications
  42. 42. Arbitrarily complex pipelines
  43. 43. Thanks.jeinkauf@amazon.comTo Learn More:aws.amazon.com/elasticmapreduceaws.amazon.com/datapipelineaws.amazon.com/big-data
  44. 44. Back to the FutureBig Data at Channel 4Bob HarrisChief Technology Officer – Channel 4 TelevisionApril 2013
  45. 45. The Disclaimer<IMHO>blah blah blah…..</IMHO>
  46. 46. C4 in the Cloud• 2008 – Started investigations into Cloud Computing• 2008 – Launched our first applications on AWS• 2009 – Entered into an Enterprise Agreement with Amazon for AWSRapid growth of AWS based offerings during 2009/2010• 2011 – AWS established as the default platform of choice for new websites
  47. 47. C4 in the Cloud
  48. 48. C4 in the Cloud• 2008 – Started investigations into Cloud Computing• 2008 – Launched our first applications on AWS• 2009 – Entered into an Enterprise Agreement with Amazon for AWSRapid growth of AWS based offerings during 2009/2010• 2011 – AWS established as the default platform of choice for new websites• 2012 – Adopted cloud-based analytics• 2013 – Investigating cloud-based back-up and archiving
  49. 49. Why Big Data?
  50. 50. Business Intelligence at C4• Well established Business Intelligence capability• Based on industry standard proprietary products• Real-time data warehousing• Comprehensive business reporting• Excellent internal skills• Good external skills availability
  51. 51. Big Data at C42011• Embarked on Big Data initiative in 2011• Ran in-house and cloud-based PoCs• Selected AWS Elastic Map Reduce2012• Ran EMR in parallel with conventional BI stack• Hive deployed to Data Analysts in 2012• EMR workflows deployed to production in 20122013• EMR confirmed as primary Big Data platform• EMR usage growing, focus on automation• Experimenting with R and Mahout
  52. 52. Big Data at C4 – Elastic MapReduce• AWS EMR established as our Big Data platform of choice• Friendly front-end developed to allow Data Analysts tostart/stop clusters and submit/track queries.
  53. 53. Big Data at C4 – Big Data Control Panel
  54. 54. Big Data at C4 – Elastic MapReduce• AWS EMR established as our Big Data platform of choice• Friendly front-end developed to allow Data Analysts tostart/stop clusters and submit/track queries.• Production workflows written predominantly in Python andPig• Fully integrated with our conventional BI stack makingEMR outputs available for reporting• Experimenting with ADP (AWS Data Pipeline)• Next steps – MapR and HBase
  55. 55. Personalising the viewer experienceMost popular dramasDramacollectionsUS dramaSingle view of the viewerrecognising them across devicesand serving relevant contentBig Data – Improving Viewer Experience
  56. 56. Myths or Truths? – It’s all about Perspective!• Nothing that can’t be done with an RDBMS• It’s a completely different approach• It’s really difficult• It’s immature and lacks good tools• It’s totally incompatible with you current BI platformand tools• It’s difficult to find skilled and experienced staffImage by Tayrawr FortuneElastic MapReduce has provided a cost effectiveapproach to establishing our Big Data platform
  57. 57. That’s all folks…bharris@channel4.co.uk@bobharrisukuk.linkedin.com/in/bobharrisuk01
  58. 58. Alan PriestleyEMEA Enterprise MarketingIntel Corporation
  59. 59. Analysis of Data Can Transform SocietyCreate new businessmodels and improveorganizationalprocesses.Enhance scientificunderstanding, driveinnovation, andaccelerate medical cures.Increase public safetyand improveenergy efficiency withsmart grids.
  60. 60. Democratizing Analytics gets Value out of Big DataUnlock Value inSiliconSupport OpenPlatformsDeliver Software Value
  61. 61. Intel at the Intersection of Big DataEnabling exascalecomputing on massivedata setsHelping enterprisesbuild openinteroperable cloudsContributing codeand fosteringecosystemHPC Cloud Open Source
  62. 62. Intel at the Heart of the CloudServerStorageNetwork
  63. 63. Scale-Out Platform Optimizations for Big DataCost-effective performance•Intel® Advanced Vector Extension Technology•Intel® Turbo Boost Technology 2.0•Intel® Advanced Encryption Standard NewInstructions Technology
  64. 64. 66Intel® Advanced Vector Extensions Technology• Newest in a long line ofprocessor instructioninnovations• Increases floating pointoperations per clock up to2X1 performance1 : Performance comparison using Linpack benchmark. See backup for configuration details.For more legal information on performance forecasts go to http://www.intel.com/performanceSoftware and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, aremeasured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult otherinformation and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
  65. 65. Intel® Turbo Boost Technology 2.0More PerformanceHigher turbo speeds maximizeperformance for single andmulti-threaded applications
  66. 66. Intel® Advanced EncryptionStandard New Instructions• Processor assistance forperforming AES encryption7 new instructions• Makes enabled encryptionsoftware faster and stronger
  67. 67. Power of the Platform built by IntelRicheruserexperiences4HRS50%Reduction10MIN80%Reduction 50%Reduction 40%ReductionTeraSort for1TB sortIntel®Xeon®ProcessorE5 2600Solid-StateDrive 10GEthernet Intel® ApacheHadoopPreviousIntel®Xeon®Processor
  68. 68. CloudIntelligent SystemsClientsVirtuous Cycle of Data-Driven Experience
  69. 69. Get 600 Hours of free supercomputingtime!www.powerof60.com
  70. 70. Thank you!

×