Big Data and the Cloud a Best Friend Story


Published on

Joe Ziegler's Presentation to the IDG Big Data World Conference in Seoul.

Published in: Technology
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The more misspelled words you collect, the better is your spellcheck application
  • Data volume. As the data volume increases, it becomes increasingly difficult to process the data. Easy for 1 box: Harder for many boxes. When the data exceeds the capacity of one place.Data structure. Data comes in variety of formats from logs files to database schema to images. The diversity in data structures and format grows as well. To analyze this data holistically it is required to consolidate data across multiple data sources and multiple formats. Since valuable data comes from various companies like facebook, and linked-in it is also required to consolidate data across businesses.
  • According to IDC, 95% of the 1.2 zettabytes of data in the digital universe is unstructured; and 70% of of this is user-generated content. Unstructured data is also projected for explosive growth, with estimates of compound annual growth (CAGR) at 62% from 2008 - 2012.ChallengesUnconstrained growth
  • Finally complexity increases because demands on data are changing. Business requires faster response time on fresher data. Sampling is not good enough, history is important. Did the customer purchase something in February because his friend has a birthday or because it was a valentine's day – this information can help figure out how to help this customer next February. SQL is simply not enough to drive some of the answers. Data scientist require access to other statistical tools or other programing languages. Finally and most importantly users demand inexpensive experimentation. Often times we don’t know what products or facts will come out of our analytics so we cannot justify large upfront investment.
  • Computers typically generate data as byproduct of interacting with people or other with other device. The more interactions, typically there is more data. This data comes in a variety of formats from semi-structured logs to in unstructured binaries. This data can be extremely valuable. It can be used to understand and track application or service behavior so that we can find errors or suboptimal user experience. We can mind it for patterns and correlations to generate recommendations.For example ecommerce sites can analyze user access logs to provide product recommendations, social networking sites provide new friends recommendations, dating sites find qualified soul mates, and so fourth.
  • Big data is important.
  • Now the Philosophy around data has changed. The philosophy is collect as much data as possible before you know what questions you are going to ask and most importantly you don't know which algorithms you are going to ask because you don't know what type of questions I might need in future. The ultimate mantra of collect and measure everything. How you are going to refine those algorithms, how much data, how much processing power, you really don't know how much resources you really need. Big data is what clouds are for. Its Big data analysis and cloud computing is the perfect marriage.Free of constraintsCollect and Store without limitsCompute and Analyze without limitsVisualize without limites
  • Data is the next industrial revolutionToday, the core of any successful company is the data it manages and its ability to effectively model, analyze and process that data quickly – almost in real time - so that it can make the right decision faster and rise to the top.
  • These resources are even more precious because of the rarity of skills.
  • Our goal, and what our customers tell us they see, is that this ratio is inverted after moving to AWS. When you move your infrastructure to the cloud, this changes things drastically. Only 30% of your time should be spent architecting for the cloud and configuring your assets. This gives you 70% of your time to focus on your business. Project teams are free to add value to the business and it's customers, to innovate more quickly, and to deliver products to market quickly as well.
  • Our goal, and what our customers tell us they see, is that this ratio is inverted after moving to AWS. When you move your infrastructure to the cloud, this changes things drastically. Only 30% of your time should be spent architecting for the cloud and configuring your assets. This gives you 70% of your time to focus on your business. Project teams are free to add value to the business and it's customers, to innovate more quickly, and to deliver products to market quickly as well.
  • New model is collect as much data as possible – “Data-First Philosophy”Allows us to collect data and ask questions laterAsk many different kinds of questions
  • There are many patterns of usage that make capacity planning a complex science. From on and off usage patterns, where capacity is only needed at fixed times and not at others, fast growth where an online service becomes so successful that step changes in traditional capacity need to be added, variable peaks - where you just don't know what demand will be when and best guess applies, to predictable peaks such as during commute times as customers use mobile devices to access your service.
  • Each of these examples is typified by wasted IT resources. Where you planned correctly, the IT resources will be over provisioned so that services are not impacted and customers lost during high demand. In the worst cases, that capacity will not be enough, and customer dissatisfaction will result. Most businesses have a mix differing patterns at play, and much time and resource is dedicated to planning and management to ensure services are always available. And when a new online service is really successful, you often can't ship in new capacity fast enough. Some say that's a nice problem to have, but those that have lived through it will tell you otherwise!
  • Elasticity with AWS enables your provisioned capacity to follow demand. To scale up when needed and down when not. And as you only pay for what is used, the savings can be significant.
  • You control how and when your service scales, so you can closely match increasing load in small increments, scale up fast when needed, and cool off and reduce the resources being used at any time of day. Even the most variable and complex demand patterns can be matched with the right amount of capacity - all automatically handled by AWS.
  • Vertical scaling on commodity hardware. Perfect for Hadoop.
  • Elasticity works from just 1 EC2 instance to many thousands. Just dial up and down as required.
  • New model is collect as much data as possible – “Data-First Philosophy”Allows us to collect data and ask questions laterAsk many different kinds of questions
  • This is supported on the AWS cloud via Amazon Elastic MapReduce its managed Hadoop service. The EMR team’s reason for living is making Hadoop, and Big Data processing, just work in the cloud. Over the last year this has led to over 2 million clusters being run on the platform by thousands of paying customers. The EMR team is also focused on ensuring that Hadoop integrates seamlessly with other AWS services, not only supporting using Amazon S3 as a file system but also integrating with CW, our cloud-based monitoring service, and DynamoDB, our managed NoSQL offering.
  • Netflix runs a persistent SLA-driven prod cluster to generate summary data and aggregate reports each day from the streaming data. The raw log data is streamed directly into the cluster from Amazon S3 with only intermediate data stored on HDFS on the cluster.
  • The processed data is then streamed back into Amazon S3 where it is accessible by other teams including personalization/recommendation services.
  • The processed data is then streamed back into Amazon S3 where it is accessible by other teams including personalization/recommendation services and to analysts through a real-time custom visualization tool called Sting.
  • Netflix also uses a wide range of languages for data processing, including Pig for ETL, Hive for sql-driven analytics, python for streaming jobs, and java map/reduce.
  • And scale is something AWS is used to dealing with. The Amazon Simple Storage Service, S3, recently passed 1 trillion objects in storage, with a peak transaction rate of 650 thousand per second. That's a lot of objects, all stored with 11 9's of durability.
  • And just like an electricity grid, where you would not wire every factory to the same power station, the AWS infrastructure is global, with multiple regions around the globe from which services are available. This means you have control over things like where you applications run, where you data is stored, and where best to serve your customers from.
  • Based on 15 years of experience . Originates from the NoSQL solution used in ecommerce side of business known as Dynamo this original No SQL solution is described in a paper we released in 2007 which is freely available. or api to define tables – we take care of provisioning & durabilitySolid State Disks You define how much you wish to reserve for reads and writes. DynamoDB will reserve the necessary machine resources to meet your throughput needs while ensuring consistent, low-latency performance.Can raise default limits
  • Netflix streams 8 TB of data into the cloud per day. This is collected, aggregated, and pushed to Amazon S3 via a fleet of EC2 servers running Apache Chukwa.
  • This is supplemented with legacy data, such as customer service info, from Netflix’s on-premise data center.
  • Low latency access to customer dimension data is served from a Cassandra deployment in the cloud.
  • How do you efficiently, and cost effectively, analyze all of that data?
  • Global reach (North Pole, Space)Native app every smartphoneSMSwebmobile-web10M+ users, 15M+ venues, ~1B check-insTerabytes of log data
  • Bank at least 400,000 simulations to get realistic results.23 hours to 20 minutes and dramatically reduced processing, with the ability to reduce even further when required.Bankinter uses Amazon Web Services (AWS) as an integral part of their credit-risk simulation application, developing complex algorithms to simulate diverse scenarios in order to evaluate the financial health of their clients. “This requires high computational power,” says Bankinter Director of New Technologies Pedro Castillo. “We need to execute at least 400,000 simulations to get realistic results.”
  • One result of such experimentation is Taste Test which is a recommendations product that helps Etsy figure out your tastes and to offer you relevant products. It works like this, you see 6 images at a time and you pick an image you like the most. You iterate through these sets of images a few times (you can also skip a set if you don’t like any images) and after a few iterations, Etsy displays the products that are most relevant to you. I encourage you to try – it’s a lot of fun.Today, Etsy uses Amazon Elastic MapReduce for web log analysis and recommendation algorithms. Because AWS easily and economically processes enormous amounts of data, it’s ideal for the type of processing that Etsy performs. Etsy copies its HTTP server logs every hour to Amazon S3, and syncs snapshots of the production database on a nightly basis. The combination of Amazon’s products and Etsy’s syncing/storage operation provides substantial benefits for Etsy. As Dr. Jason Davis, lead scientist at Etsy, explains, “the computing power available with [Amazon Elastic MapReduce] allows us to run these operations over dozens or even hundreds of machines without the need for owning the hardware.”Dr. Davis goes on to say, “Amazon Elastic MapReduce enables us to focus on developing our Hadoop-based analysis stack without worrying about the underlying infrastructure. As our cycles shift between development and research, our software and analysis requirements change and expand constantly, and [Amazon Elastic MapReduce] effectively eliminates half of our scaling issues, allowing us to focus on what is most important.”Etsy has realized improved results and performance by architecting their application for the cloud, with robustness and fault tolerance in mind, while providing a market for users to buy and sell handmade items online.
  • Another example of such innovation is gift ideas. A lot of us struggle to pic the right present for our friends and so Etsy has a product that makes it easier. Etsy looks at your facebook social graph and learns about your interests and those of your friends. It uses this information to give you ideas for presents. For example, if your friend is an REM fan, Etsy may suggest a t-shirt with REM print on it.These innovative data products are just a few examples of innovation that is possible if we lower the cost barriers for data experimentation.
  • Yelp is also doing product recommendations based on location, people reviews, or people searches. For example, “people who viewed this, viewed that” feature can help customers discover other relevant options in the area. People can discover interesting facts about places with “People viewed this after searching for that” feature. In this example, the westin hotel probably has glass elevators and is likely offers the best location to stay in san francisco at least by some definition of best.There is also “review highlights” feature. Yelp analyses written reviews and provides highlights about the places, so that their customers don’t have to read through all the reviews to get basic ideas about the place. All these differentiating features were possible because of Hadoop and flexible infrastructure for data processing.
  • 500% increase in returns for advertising.Pedabytes of storage.There is a lot of data the retail business has about the users, it’s just never used it in advertising.For example, the retail knows that the customer has purchased a sports movie and is currently searching for video games, so it may make sense to advertise a sports video game for the customer.Efficient: Elastic infrastructure from AWS allows capacity to be provisioned as needed based on load, reducing cost and the risk of processing delays. Amazon Elastic MapReduce and Cascading lets Razorfish focus on application development without having to worry about time-consuming set-up, management, or tuning of Hadoop clusters or the compute capacity upon which they sit.Ease of integration: Amazon Elastic MapReduce with Cascading allows data processing in the cloud without any changes to the underlying algorithms.Flexible: Hadoop with Cascading is flexible enough to allow “agile” implementation and unit testing of sophisticated algorithms.Adaptable: Cascading simplifies the integration of Hadoop with external ad systems.Scalable: AWS infrastructure helps Razorfish reliably store and process huge (Petabytes) data sets.The AWS elastic infrastructure platform allows Razorfish to manage wide variability in load by provisioning and removing capacity as needed. Mark Taylor, Program Director at Razorfish, said, “With our implementation of Amazon Elastic MapReduce and Cascading, there was no upfront investment in hardware, no hardware procurement delay, and no additional operations staff was hired. We completed development and testing of our first client project in six weeks. Our process is completely automated. Total cost of the infrastructure averages around $13,000 per month. Because of the richness of the algorithm and the flexibility of the platform to support it at scale, our first client campaign experienced a 500% increase in their return on ad spend from a similar campaign a year before.”
  • Big Data and the Cloud a Best Friend Story

    1. Amazon Web ServicesBig Data and the Cloud: A Best Friend Story
    2. Joe ZieglerTechnical @jiyosub
    3. 죠 지글러테크니컬 에벤젤리스트 @jiyosub
    4. Characteristics of Big Data How the Cloud Is Big Data’s Best Friend Big Data on the Cloud In the Real World
    5. Characteristics of Big Data
    6. BIG DATA When your data sets become so large that you have to startinnovating how to collect, store, organize, analyze and share it
    7. Bigger Data isBetter Data
    8. Features driven by MapReduce
    9. Bigger Data isHarder Data
    10. Big Data is Getting Bigger Unconstrained data growth 95% of the 1.2 zettabytes of ZB data in the digital universe is unstructured 70% of of this is user- EB generated content Unstructured data growth explosive, with estimates of PB compound annual growth (CAGR) at 62% from 2008 –GB TB 2012. Source: IDC
    11. Big Data is Hard and getting harder Changing Data Requirements Faster response time of fresher dataSampling is not good enough & history is important Increasing complexity of analytics Users demand inexpensive experimentation
    12. Where is it Coming From?Computer Generated Human Generated• Application server logs • Twitter “Fire Hose” 50m (web sites, games) tweets/day 1,400% growth• Sensor data per year (weather, water, smart • Blogs/Reviews/Emails/Pict grids) ures• Images/videos • Social Graphs: Facebook, (traffic, security cameras) Linked-in, Contacts
    13. Storage Big Data Compute Big Data How quickData has gravity it? do you need to read App Data App
    14. Storage Big Data Compute Big Data …and inertia atto read quick do you need volume… How…and inertia at volume… it? Data
    15. Storage Big Data Compute Big Data …easierquick inertiaapplications to the data to move need to read How…and do youat volume… it? Data
    16. The Role of Data is Changing
    17. Until now, Questions you ask drove Data model New model is collect as much data as possible – “Data-First Philosophy”
    18. Data is the new raw material forData is theanyraw material for on business on par new business any par with with capital, people, labor capital, people, labor
    19. We Need Tools Built Specifically for Big Data
    20. Hadoop• Scale out Easily • Solves some Problems• Parallel Computing • Complex to Run• Commodity Hardware • Special Skills to Maintain
    21. How the Cloud IsBig Data’s Best Friend
    22. How do we define the cloud? By Benefits!
    23. No Cap Ex Pay Per Elasticity Use CloudFast Time to Market Focus on core competency
    24. Why is the CloudBig Data’s Best Friend
    25. We know we want collect, store,organize, analyze and share it.But we have limited resources.
    26. The Cloud OptimizesPrecious IT Resources i.e. Skilled People
    27. “Over the next decade, the number of files or containersthat encapsulate the information in the digital universewill grow by 75x.While the pool of IT staff available to manage them willgrow only slightly. At 1.5x” - 2011 IDC Digital Universe Study
    28. Deploying a Hadoop cluster is hard
    29. Cloud computing 30% 70%The Old Using Big Managing All of theIT World Data “Undifferentiated Heavy Lifting”
    30. Cloud computing 30% 70% The Old Using Big Managing All of the IT World Data “Undifferentiated Heavy Lifting” Configuring Cloud-Based Analyzing and Using Big Data CloudInfrastructure Assets 70% 30%
    31. The Cloud Reduces CostFor Experimentation
    32. ManagedReusability Services Scale Innovation
    33. ManagedReusability Services Scale Innovation
    34. ManagedReusability Services Scale Innovation
    35. ManagedReusability Services Scale Innovation
    36. ManagedReusability Services Scale Innovation
    37. The Cloud OptimizesCapacity Resources
    38. Elastic Compute Capacity On and Off Fast Growth Variable peaks Predictable peaks
    39. Elastic Compute Capacity WASTE On and Off Fast Growth Variable peaks Predictable peaks CUSTOMER DISSATISFACTION
    40. Elastic Compute CapacityCapacity Traditional IT capacity Elastic cloud capacity Time Your IT needs
    41. Elastic Compute Capacity On and Off Fast Growth Variable peaks Predictable peaks
    42. The CloudEmpowers Users to Balance Cost and Time
    43. 1 instance for 500 hours =500 instances for 1 hour
    44. Storage Big Data Compute Big Data From one instance… How quick do you need to read it?
    45. Storage Big Data Compute Big Data …to thousands How quick do you need to read it?
    46. The Cloud Scales
    47. AMAZON ELASTIC MAPREDUCE • Managed Hadoop offering in the cloud • Integration with other AWS services • Thousands of customers ran over 2 million clusters on EMR over the last year
    48. Prod Cluster S3 (EMR) EMR HDFSData streamed directly from S3 to the cluster
    49. Prod ClusterS3 (EMR) EMR HDFS Results streamed back to S3
    50. Recommendation Ad-hoc Engine Analysis Personalization Prod Cluster S3 (EMR) EMRData consumed in multiple ways
    51. Prod Cluster (EMR) S3 EMRWide range of processing languages used
    52. The CloudEnables Collection and Storage of Big Data
    53. Simple Storage Service 1 Trillion 1000.000 750.000 500.000 250.000 0.000 650k+ peak transactions per second
    54. Global Accessibility RegionUS-WEST (N. California) EU-WEST (Ireland) GOV CLOUD ASIA PAC (Tokyo) US-EAST (Virginia)US-WEST (Oregon) ASIA PAC (Singapore) SOUTH AMERICA (Sao Paulo)
    55. Amazon DynamoDBDynamoDB is a fully managed NoSQL database servicethat provides extremely fast and predictable performancewith seamless scalability Zero Administration Low Latency SSD’s Reserved Capacity Unlimited Potential Storage and Throughput
    56. The Cloud EnablesProcessing
    57. We know we wantcollect, store, organize, analyze and share it. But we have limited resources.
    58. Big Data on the Cloud In the Real World
    59. Big Data Verticals SocialMedia/Adverti Financial Oil & Gas Retail Life Sciences Security Network/Gami sing Services ng User Anti-virus Targeted Monte Carlo Demographics Recommend Advertising Simulations Seismic Genome Fraud Usage analysis Analysis Analysis Detection Image and Transactions Video Risk Analysis Analysis Image In-game Processing Recognition metrics
    60. Netflix Web Services (Honu) S38 TB of event data per day
    61. S3 Legacy DataLegacy data from on-premise Netflix Data Center data center
    62. Customer dimension data stored in Cassandra
    63. S3~1 PB of data stored in Amazon S3
    64. Visualizations
    65. Bank – Monte Carlo Simulations “The AWS platform was a good fit for its unlimited and flexible computational power to23 Hours to our risk-simulation process requirements. With AWS, we now have the power to decide20 Minutes how fast we want to obtain simulation results, and, more importantly, we have the ability to run simulations not possible before due to the large amount of infrastructure required.” – Castillo, Director, Bankinter
    66. Recommendations The Taste Test
    67. RecommendationsGift Ideas for Facebook Friends
    68. Click Stream Analysis User recently purchased a Targeted Adsports movie and (1.7 Million per day) is searching for video games
    69. Characteristics of Big Data How the Cloud Is Big Data’s Best Friend Big Data on the Cloud In the Real World
    70. Questions?
    71. 죠 지글러테크니컬 에벤젤리스트 @jiyosub