AWS Sydney Summit 2013 - Big Data Analytics


Published on

Session 3, Presentation 1 from the AWS Sydney Summit

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The key messages that we want to deliver with this slide are 1. Elastic MapReduce is a hosted hadoop service. We use the most stable version of apache hadoop and provide a hosted service, and build integration point withs other services on the AWS eco-system such as S3, Cloudwatch, Dynamodb etc. We make other improvements to Hadoop so that it becomes easier to scale and manage on AWS2. We will keep iterating on the different versions of hadoop as they become stable. When you use the console you launch the latest version of hadoop, but you also have the choice or launching an older version of hadoop via the CLI or the SDK. 3. So what all you can do with EMR ?You can build applications on Amazon EMR, just like you would with HadoopIn order to develop custom Hadoop applications, you used to need access to a lot of hardware to test your Hadoop programs. Amazon EMR makes it easy to spin up a set of Amazon EC2 instances as virtual servers to run your Hadoop cluster. You can also test various server configurations without having to purchase or reconfigure hardware. When you're done developing and testing your application, you can terminate your cluster, only paying for the computational time you used.Amazon EMR provides three types of clusters (also called job flows) that you can launch to run custom map-reduce applications, depending on the type of program you're developing and which libraries you intend to use.
  • EMR supports multiple instance types including the latest HS1 instance types EMR now supports High Storage Instances (hs1.8xlarge) in US East. These new instances offer 48 TB of storage across 24 hard disk drives, 35 EC2 Compute Units (ECUs) of compute capacity, 117 GB of RAM, 10 Gbps networking, and 2.4+ GB per second of sequential I/O performance. High Storage Instances are ideally suited for Hadoop and they significantly reduce the cost of processing very large data sets on EMR. We look forward to adding support for High Storage Instances in additional regions early next year.
  • 10 x 10 = 100 nodes running for 1 hour
  • And the concept of adding nodes works well with hadoop – especially on the cloud since 10 nodes running for 10 hours costs the same as 100 nodes running for 1 hour.
  • 10 x 10 = 100 nodes running for 1 hour
  • 10 x 10 = 100 nodes running for 1 hour
  • 10 x 10 = 100 nodes running for 1 hour
  • 10 x 10 = 100 nodes running for 1 hour
  • Speaker Notes:Often the question about Big Data is, “What can it do for me?” And that’s a very important question because without the value proposition, Big Data would just be an exercise. But I’m here to tell you Big Data services, provided by AWS and supported by Intel, are a Game Changer.For example: Yes, Big Data offers insights into how we conduct business. But it also enables scientific discovery, opens up the possibility to treat and cure diseases, and enhances our communities with intelligent power grids and highways. These are just a handful of ideas. The frontier of Big Data is so much more. The technology provided means no limits to how you use the information. People are innovating new uses for Big Data every day.
  • Speaker notes:Intel’s vision of Big Data is more than just the possibility for streamlined business. We see entire cities and communities connected, using the data we generate in every aspect – business and personal – to inform us and enable us to make better decisions about our lives. And all of this is made possible by the innovations developed in partnership between Intel and Amazon Web Services. A Big Data infrastructure, vast enough to handle the data we produce, and cost effective enough for us to use. Big Data really is about the, a future of challenges and great opportunities AWS and Intel are ready and eager to tackle.
  • Speaker notes:As you can see, Intel is at the intersection of enabling Big Data:- Exascale-level High Performance Computing and cloud environments based on Intel® Xeon® processors. - Plus, Intel is encouraging the growth of the open source ecosystem to foster innovation among developers, and keep cloud services, like AWS, affordable for all.
  • Speaker Notes:And to be at that intersection, to allow the proverbial traffic of Big Data goes smoothly, we’ve built the technological backbone for Big Data. The challenges to scale and the capabilities we’ve built into the Intel® Xeon® processor are needed across the entire data center – servers, storage devices and network solutions. It should be noted, Intel is #1 in Servers, Storage and Networks. - These industry-standard, modular building blocks allow efficient and cost-effective scaling of compute, storage and network systems to match user needs.- Traditionally storage devices used lower performance, proprietary ASICs, but today the demand for performance has increased to tackle challenges like data de-duplication and improved archiving. This in addition to distributed files systems for cloud based storage and a desire for improved analytics drives a need for more processing power… and vendors are increasingly turning to Intel® Xeon® processors. Plus, the improvements that Intel offers in our latest processors can benefit every aspect of what your infrastructure does. And these building blocks are what makes amazing software like Hadoop work.
  • Speaker Notes:Key points:Intel® Xeon® Processor E5 Family provides:Cost-effective performanceIntel® Advanced Vector Extension TechnologyIntel® Turbo Boost Technology 2.0 Intel® Advanced Encryption Standard New Instructions Technology Significant performance gains delivered by featuressuch as new Intel® Advanced Vector Extensions and improved Intel® Turbo Boost Technology 2.0 providing performance when you need it. Dramatically reduce compute time with Intel® Advanced Vector Extensions Accelerate floating point calculation for scientific simulations & financial analyticsPerformance when you need it with Intel® Turbo Boost Technology 2.0 Up to 80% performance boost vs. prior gen To improve flexibility and operational efficiency significant improvements in I/O with new Intel® Integrated I/O which reduces latency ~30% will adding more lanes and higher bandwidth with support for PCI Express 3.0Cost-effective performance for standardizing scale out nodes for Hadoop Intel® AES-NI to accelerate security encryption workloads Optimized core to memory footprint ratios Top Memory Channels and frequency for nothing shared scalingStory:To meet the growing demands of IT such as readiness for cloud computing, the growth in users and the ability to tackle the most complex technical problems, Intel has focused on increasing the capabilities of the processor that lies at the heart of a next generation data center. The Intel® Xeon® processor E5-2600 product family is the next generation Xeon® processor that replaces Platforms based on the Intel® Xeon® processor 5600 & 5500 series. Continuing to build on the success of the Intel® Xeon® 5600, the E5-2600 product family has increased core count and cache size in addition to supporting more efficient instructions with Intel® Advance Vector Extensions, to deliver up to an average of 80% more performance across a range of workloads. These processors will offer better than ever performance no matter what your constraint is – floor space, power or budget – and on workloads that range from the most complicated scientific exploration to simple, yet crucial, web serving and infrastructure applications. In addition to the raw performance gains, we’ve invested in improved I/O with Intel Integrated I/O which reduces latency ~30% will adding more lanes and higher bandwidth with support for PCIe 3.0. This helps to reduce network and storage bottlenecks to unleash the performance capabilities of the latest Xeon processor. The Intel® Xeon® processor E5-2600 product family – versatile processers at the heart of today’s data center.
  • Key points: Intel® Advanced Vector Extensions Technology is a collection of CPU instructions that increase floating point performance by doubling the length of the FP registers to 256-bits and reducing the number of operations required to execute large FP tasks Applications include: Science/Engineering, Data Mining, Visual Processing, HPCStory:Another avenue that Intel has taken advantage to add more flexible performance is to add in instructions that make the processor do more work every clock cycle. Intel® Advanced Vector Extensions can offer up to double the floating point operations per clock cycle by doubling the length of registers. Where this is used is when you need to address very complex problems or deal with large-number calculations, integral to many technical, financial and scientific computing problems. Workloads that can see improvements from AVX range from manufacturing optimizations, to the analysis of competing options to content creation and engineering simulations. Intel® AVX is the newest in a long line of instruction innovations going back to the mid 90’s with MMX and SSE1 which are all now standard software practices. Intel AVX is supported by Intel and 3rd party compilers that take advantage of the latest instructions to optimize code to significantly reduce compute time enabling faster time to results. With the Xeon processor E5-2600 family you can be confident that you’ll benefit from those optimizations as new applications are introduced and updates to existing software packages are released.Legal Info:(AVX Performance) Source: Performance comparison using Linpack benchmark. Baseline score of 159.4 based on Intel internal measurements as of 5 December 2011 using a Supermicro* X8DTN+ system with two Intel® Xeon® processor X5690, Turbo Enabled, EIST Enabled, Hyper-Threading Enabled, 48 GB RAM, Red Hat* Enterprise Linux Server 6.1. New score of 347.7 based on Intel internal measurements as of 5 December 2011 using an Intel® Rose City platform with two Intel® Xeon® processor E5-2690, Turbo Enabled or Disabled, EIST Enabled, Hyper-Threading Enabled, 64 GB RAM, Red Hat* Enterprise Linux Server 6.1. Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase.
  • Key points:Get more computing power when you need it with performance that adapts to spikes in your workload. with Intel® Turbo Boost Technology 2.0New Intel® Turbo Boost Technology 2.0 delivers up to 2x more performance upside than previous generation turbo technology.Story:Beyond simply making the processor more capable with more cores, cache, & memory we’ve also focused on making the processor more adaptive and intelligent. Starting with the Intel® Xeon® processor 5500 series (formerly codenamed Nehalem-EP) we introduced a feature called Intel Turbo Boost Technology which allowed the processor to increase frequency at the OS’ request to handle workload spikes as well as shift power across the processor so if you had one core working hard and one core idle the processor could “turbo up” by redirecting power from the idle core to the active one. With the Xeon processor E5-2600 product family we are able to refine this technology to enable even higher turbo speeds – for example the top Xeon processor 5690 with only 1 core active could turbo up ~266 MHz while the top Xeon processor E5-2690 can frequency 900 MHz specifically. This greater ability to turbo up is due to improved power and thermal management data across the platform – the processor keeps track of how hard it’s been running and will modulate how far it will push itself in a turbo situation to provide the maximum frequency while meeting Intel’s stringent reliability standards. In addition we’ve improved the turbo algorithm to assess if the core speed is the limiter or if the processor is waiting for data from memory or I/O before it commits power to the burst of speed. The goal of turbo is to get workload spikes dealt with as quickly as possible to get back to a lower power state which reduces average power draw and cost of operation.Legal Info:Source: Performance comparison using SPECint*_rate_base2006 benchmark with turbo enabled and disabled. Estimated scores of 393 (turbo enabled) and 376 (turbo disabled) based on Intel internal estimates as of 6 March 2012 using a Supermicro* X8DTN+ system with two Intel® Xeon® processor X5690, Turbo Enabled (or Disabled), EIST Enabled, Hyper-Threading Enabled, 48 GB RAM, Intel® Compiler 12.0, Red Hat* Enterprise Linux Server 6.1 for x86_6. Estimated scores of 659 (turbo enabled) and 594 (turbo disabled) based on Intel internal estimates using an Intel® Rose City platform with two Intel® Xeon® processor E5-2680, Turbo Enabled (or Disabled), EIST Enabled, Hyper-Threading Enabled, 64 GB RAM, Intel® Compiler 12.1, Red Hat* Enterprise Linux Server 6.1 for x86_6.
  • Intel AES-NI: What is it?Key Point: Data Encryption shows 10xspeedup1 in AES encryptionIntel AES-NI is a set of new instructions for enhancing the performance for cryptography using the widely-accepted Advanced Encryption Standard (AES) algorithm.There are 7 new instructions in the processor that target some of the more complex and compute-expensive encryption, decryption, key expansion and multiplication steps (and there are multiple steps in every instance of working with encrypted data) that increase the performance and efficiency of these operations. But note that the instructions do not implement the entire AES algorithm in silicon—only the most processor intensive elements have been targeted. This provides more flexibility and balance between HW performance and SW extensibility. Another benefit of the new instructions is that actually helps protect the data better as well. The use of the more efficient steps enabled in AES-NI makes the use of “side channel” snooping attacks. These attacks use SW agents to analyze how a system processes data and searches for cache and memory access patterns to try to gather patterns or other system data to help deduce elements of the cryptographic processing—and therefore make it easier to “crack”. AES-NI helps hide critical elements such as table lookups, making it harder to determine what elements of crypto processing are happening.Taking down the performance tax frees IT managers to use encryption more broadly without sacrificing performance.
  • Speaker Notes:So let’s see rubber meet road and look at how the technology enables high performance computing. Right here you’re seeing the Intel-based ecosystem at work. - Start with a 4 hour process time to sort 1 Terabyte of data. - Upgrade the processor to the latest Intel® Xeon® processor to cut compute time in half.- Add an SSD to reduce by another 80%.- Upgrade to 10 Gigabit Ethernet for additional reductions.The end result is a fraction of the original compute time: 10 minutes to sort 1 Terabyte of data. These datacenter innovations streamline the process and make affordable Big Data analytics possible.As this testing shows, as important as the processor is in improving the customer experience, it’s not the entire solution. By understanding the benefits of SSDs, 10GbE and Intel SW tools we can give an even better experience with Intel optimized platforms, and boost business results.
  • Speaker Notes:If you wanted to see this process of transforming Big Data into action, it would look something like this.- Big Data provides rich, personalized, immersive experiences for clients. - This in turn creates more rich interactions, and generates more data into the cloud.- Which leads to higher volumes of data to analyze through intelligent systems, - Which leads to even more rich, personalized, and immersive experiences. As you can see, the cycle feeds into itself. And, this brings users into the fold. We’re not just talking businesses anymore, but we’re looking at how Big Data affects us all on a day-to-day basis.
  • AWS Sydney Summit 2013 - Big Data Analytics

    1. 1. Abhishek SinhaBig Data AnalyticsBusiness Development Manager
    2. 2. Overview• The Big Data Challenge• Big Data tools and what can we do with them ?• Packetloop – Big Data Security Analytics• Intel technology on big data.
    3. 3. An engineer’s definitionWhen your data sets become so large that you have to startinnovating how to collect, store, organize, analyze andshare it
    4. 4. GenerationCollection & storageAnalytics & computationCollaboration & sharing
    5. 5. GenerationCollection & storageAnalytics & computationCollaboration & sharingLower cost,higher throughput
    6. 6. GenerationCollection & storageAnalytics & computationCollaboration & sharingLower cost,higher throughputHighlyconstrained
    7. 7. Generated dataAvailable for analysisData volumeGartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
    8. 8. Amazon Web Services helps removeconstraints
    9. 9. Remove constraints = More experimentationMore experimentation = More innovationMore Innovation = Competitive edge
    10. 10. Elastic MapReduce and RedshiftBig Data tools
    11. 11. EMR is Hadoop in the Cloud
    12. 12. What is Amazon Redshift ?Amazon Redshift is a fast and powerful, fully managed,petabyte-scale data warehouse service in the AWScloudEasy to provision and scaleNo upfront costs, pay as you goHigh performance at a low priceOpen and flexible with support for popular BI tools
    13. 13. Elastic MapReduce and RedshiftBig Data tools
    14. 14. How does EMR work ?EMREMR ClusterS3Put the datainto S3Choose: Hadoop distribution, # ofnodes, types of nodes, customconfigs, Hive/Pig/etc.Get the output fromS3Launch the cluster using theEMR console, CLI, SDK, orAPIsYou can also storeeverything in HDFS
    15. 15. What can you run on EMR…S3EMREMR Cluster
    16. 16. EMREMR ClusterResize NodesS3You can easily add andremove nodes
    17. 17. Resize Nodes with Spot InstancesCost without Spot10 node cluster running for 14 hoursCost = 1.2 * 10 * 14 = $168
    18. 18. Resize Nodes with Spot InstancesCost without Spot Add 10 nodes on spot10 node cluster running for 14 hoursCost = 1.2 * 10 * 14 = $16820 node cluster running for 7 hoursCost = 1.2 * 10 * 7 = $84= 0.6 * 10 * 7 = $42
    19. 19. Resize Nodes with Spot InstancesCost without Spot Add 10 nodes on spot10 node cluster running for 14 hoursCost = 1.2 * 10 * 14 = $16820 node cluster running for 7 hoursCost = 1.2 * 10 * 7 = $84= 0.6 * 10 * 7 = $42= Total $12625% reduction in price50% reduction in time
    20. 20. Ad-Hoc Clusters – What are they ?EMR ClusterS3When processing is complete, youcan terminate the cluster (and stoppaying)1
    21. 21. Ad-Hoc Clusters – When to useEMR ClusterS3Not using HDFSNot using the cluster 24/7Transient jobs1
    22. 22. EMREMR Cluster“Alive” Clusters – What are they ?S3If you run your jobs 24 x 7 , youcan also run a persistent clusterand use RI models to save costs2
    23. 23. EMREMR Cluster“Alive” Clusters – When ?S3Frequently running jobsDependencies on map-reduce-mapoutputs2
    24. 24. S3 instead of HDFSS3EMREMR Cluster• S3 provides 99.99999999999% ofdurability• Elastic• Version control against failure• Run multiple clusters with a singlesource of truth• Quick recovery from failure• Continuously resize clusters3
    25. 25. S3 and HDFSS3EMREMR ClusterLoad data from S3 using S3DistCPBenefits of HDFSMaster copy of the data in S3Get all the benefits of S3HDFSS3distCP4
    26. 26. Elastic MapReduce and RedshiftBig Data tools
    27. 27. Reporting Data-warehouseRDBMSRedshiftOLTPERPReportingand BI1
    28. 28. Live Archive for (Structured) Big DataDynamoDBRedshiftOLTPWeb Apps Reportingand BI2
    29. 29. Cloud ETL for Big DataRedshiftReportingand BIElastic MapReduceS33
    30. 30. Streaming Hive Pig DynamoDB RedshiftUnstructuredData✓ ✓Structured Data ✓ ✓ ✓ ✓LanguageSupportAny* HQL Pig Latin Client SQLSQL ✓SQL-Like ✓Volume Unlimited Unlimited Unlimited RelativelyLow1.6 PBLatency Medium Medium Medium Ultra Low Low
    31. 31. Collection & storageAnalytics & computationCollaboration & sharingRemoveConstraintsGeneration
    32. 32. Scott CranePacketloop – Big Data Security AnalyticsCEO & Co-founder
    33. 33. Disclaimer and Urban MythCustomers must make the decision to upload data to Packetloop.We do not transparently intercept customer traffic, nor is it possible withinAWS to do this.AWS does not give us access to any other AWS customer traffic.
    34. 34. What is Packetloop?• Big Data Security Analytics• Uses complete data set from the network flow via packet capture• 100% delivered in the Cloud• Instantly available, always up to date• Powerful visualizations• Intuitive to use• Reduces security analysis to minutes
    35. 35. What business problems are we solving?• Security related information is growing exponentially• The current generation of technology is struggling to deliver the intelligenceorganizations needs, and these technologies create friction due to:– Solution complexity– Amount of integration and customization required– Lack of context and fidelity• Threats are becoming more complex, including blended attacks and longrunning attacks (spanning months and potentially terabytes of flow data)• Analysts have less time and are forced to be more reactive
    36. 36. Who are we targeting?• Any organization that definitively wants to know exactly what is happening ontheir networks using information that can be determined in real-time and theinformation that can be added over time.• Customers that are currently not receiving what was promised by SIEMsolutions in terms of analytics, size and scale, fidelity and drill-down capabilities.• Organizations that are already leveraging Cloud providers such as AmazonAWS.• Security consultants, Analysts, Penetration Testers who want to take packetcaptures and quickly analyze them by uploading to the cloud.
    37. 37. What business challenges did we face?• Fastest processing possible• Infinite scale and storage• Global presence• Always be available and up to date• Commodity affordability• Small team of people• Limited capital• Based only in Sydney• Current databases don’t scale theway we needed.The Vision The Reality
    38. 38. Why choose AWS?• Brand – number 1 in Cloud market• Presence - everywhere we need to be• Availability options – allows us to build in the resilience we need• Flexibility and elasticity – only use what we need and when we need it, whilstsupporting limitless horizontal growth• Feature sets - always expanding, allows us to constantly refine our offering• Support – AWS supports our business growth• Cost – low to start with, always improving, easy to understand and predict
    39. 39. What do we use?PgSQLCASS CASSLOOP IPSWEB WEBSubnet A/24Subnet B/24ZONE: US-WEST-2a ZONE: US-WEST-2bNAT to Elastic IPs NAT to Elastic NetworkPgSQLCASS CASSLOOP IPSWEB WEBSubnet C/24Subnet D/24Loop NetworkVPCROUTERCassandra Replicates between availability zonesPostgres is Active/Active between availability zonesElastic Load BalancerEMR-1 EMR-N EMR-1 EMR-N
    40. 40. What do we use?• Elastic MapReduce (EMR) – Hadoop to process jobs to extract securityanalytics• Cassandra – Patented insertion method for storing security metrics data• PgSQL – user databases, customer settings• IPS – 2 open source and 2 commercial to obtain indicators and warnings• S3 – Packet capture storage, both long term and temporary• VPC – handles replication and active/active traffic between Availability Zones• Elastic Load Balancer – allows us to scale out Web instances as needed• Cloudflare (not shown) – cache and acceleration
    41. 41. What has AWS allowed us to achieve?• Global presence and big company performance• To be the first truly Cloud centric Security Analytics tool• Deliver a revolutionary security analytics tool to any user/analyst on the Internetas a commodity service (charged per GB/per month)• To dynamically change development and architecture direction without worryingabout any capital investment we may have already made, and while maintaininga full production instance• Determine exactly what we spend and 100% link it to customer demand• To remain a self funded startup
    42. 42. What’s next?• Shift from batch processing and post hoc analysis to real time processing• Addition of On Premise appliances, Virtual Machines and AMIs to perform localcapture, preprocessing and transmission of security metrics to Cloud• Additional modules for analyzing Sessions, Protocols and Files• Move to Probabilistic Threat Analysis using machine learning
    43. 43. Do your own Big Data Security Analytics…..• Packetpig is an open source version of our Network Security Analytics toolsetavailable at• Optimised in October 2012 to use AWS Elastic Map Reduce - how to• Configurable scripts to specify what size AWS instances are used for Hadoop,and how many instances are to be spawned to run the mappers and reducers
    44. 44. Thank
    45. 45. Corey Loehrcorey.loehr@intel.comExecutive, DigitalEconomy EnablementIntel Australia and NewZealand
    46. 46. Analysis of Data Can TransformSocietyCreate newbusinessmodels andimproveorganizational processes.Enhancescientificunderstanding, driveinnovation,andaccelerateIncreasepublic safetyand improveenergyefficiencywith smartgrids.
    47. 47. Democratizing Analytics getsValue out of Big DataUnlockValue inSiliconSupport OpenPlatformsDeliverSoftwareValue
    48. 48. Intel at the Intersectionof Big DataEnablingexascalecomputing onmassive datasetsHelpingenterprisesbuild openinteroperable cloudsContributing code andfosteringecosystemHPC CloudOpenSource
    49. 49. Intel at the Heart of the CloudServerStorageNetwork
    50. 50. Scale-Out PlatformOptimizations for Big DataCost-effectiveperformance•Intel® Advanced VectorExtension Technology•Intel® Turbo BoostTechnology 2.0•Intel® AdvancedEncryption Standard NewInstructions Technology
    51. 51. 52Intel® Advanced VectorExtensions Technology• Newest in along line ofprocessorinstructioninnovations• Increasesfloating pointoperations perclock up to2X1performance1 : Performance comparison using Linpack benchmark. See backup for configuration details.For more legal information on performance forecasts go to and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, aremeasured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult otherinformation and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
    52. 52. Intel® Turbo Boost Technology2.0MorePerformanceHigher turbospeeds maximizeperformance forsingle andmulti-threadedapplications
    53. 53. Intel® AdvancedEncryptionStandard NewInstructions• Processorassistance forperforming AESencryption7 new instructions• Makes enabledencryption softwarefaster and stronger
    54. 54. Power of the Platform builtby IntelRicheruserexperiences4HRS50%Reduction10MIN80%Reduction 50%Reduction40%ReductionTeraSort for1TBsortIntel®Xeon®ProcessorE52600Solid-StateDrive10GEthernet Intel®ApacheHadoopPreviousIntel®Xeon®Processor
    55. 55. CloudIntelligentSystemsClientsVirtuous Cycle of Data-Driven Experience
    56. 56. Big Data Analytics