Your SlideShare is downloading. ×
0
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Big Data Lessons from the Cloud

454

Published on

Learn about the Challenge of Big Data and how Hadoop in the Cloud, a flexible infrastructure for Big Data, is changing everything!

Learn about the Challenge of Big Data and how Hadoop in the Cloud, a flexible infrastructure for Big Data, is changing everything!

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
454
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Map Reduce is a paradigm shiftGoogle Poster ChildWhat exactly does Hadoop look like?
  • There are many drivers for Hadoop adoption…
  • Let’s start with this chart. To reinforce you’re in the right room you picked the right session…Hadoop Not only is it the fastest growing Big Data technology…It is one of the fastest technologies period….Hadoop adoption is happening across industries and across a wide range of application areas.What’s driving this adoption
  • This is a Hadoop distribution it includes a series of open source packages that are tested, hardened and combined into a complete suite. With MapR we’ve combined this with our own innovations at the data platform level to make it highly available, dependable and easier to access and integrate through industry standards like NFS, ODBC, etc…
  • How do you benefit. I mentioned that used wide variety of use cases…I’ve generalized these into 4 groups… The first
  • Is expanding data….Sampled to all of the transactions, ….. Netflix….recommends 5 movies to you and. It’s because they look at everybody’s movie watching and ratings and identify like clusters of individuals like you….Risk triangles for insurance companies go from zip code level down to the neighborhood street…Trading information going for last 3 months to 7 years….
  • Let’s look at a specific example…
  • Load CDR – Call detail records into the data warehouse and transform data into the proper format for processing and analysis…
  • The problem with this process is that 70% of the EDW load is related to the CDR normalization process AI: Why is this the case?CDR normalization difficult within the EDWBinary extraction and conversion to SQL is difficult
  • IDEXX (Current client M3 on EMR)  IDEXX is the leader in veterinary equipment and also make software for clinics, etc.  Aggregating some data from veterinary clinics that have IDEXX software. MapR cluster internally with 4-5 servers at the time, using that successfully for a few months. Terry went to the AWS conference in November, and learned about EMR. Tried it out, liked the flexibility especially in their use case where there aren't jobs all the time. Example: One job runs once a week for a few hours. 6000 veterinary practices. Each night receive a data extract from each one (pipe-delimited file). Includes all the products that were sold that day. Hadoop is used for aggregations, then use Sqoop to load into another Oracle database for the analysts. Now they have another project. This project is compiled using Java 7 and they use some features for Java 7 (and it's part of a much larger project that uses Java 7). AI Itay: Send them the exact instructions for using Java 7 with MapR/EMR. Processing similar data to the first project. In this case, they are creating a list of vaccinations for each animal. Provide a portal to end-users with all the medical details. 
  • The first is “simple algorithms and lots of data trump complex models”. This comes from an IEEE article written by 3 research directors at Google. The article was titled the “Unreasonable effectiveness of Data” it was reaction to an article called “The Unreasonable Effectives of Mathematics in Natural Science” This paper made the point that simple formulas can explain the complex natural world. The most famous example being E=MC2 in physics. Their paper talked about how economist were jealous since they lacked similar models to neatly explain human behavior. But they found that in the area of Natural Language Processing an area notoriously complex that has been studied for years with many AI attempts at addressing this. They found that relatively simple approaches on massive data produced stunning results. They cited an example of scene completion. An algorithm is used to eliminate something in a picture a car for instance and based on a corpus of thousands of pictures fill in the the missing background. Well this algorithm did rather poorly until they increased the corpus to millions of photos and with this amount of data the same algorithm performed extremely well. While not a direct example from financial services I think it’s a great analogy. After all aren’t you looking for an approach that can fill in the missing pieces of a picture or pattern.
  • Okay interesting graphs how does this translate to the real world. Here are some broad examples.
  • Start with the right platform…Power to address your needs and the flexibility to grow with your expansion..----- Meeting Notes (4/3/13 14:27) -----examples of functionality that makes applicatoins better…custom codeintegrate time to marketproduction gradeRSA - security event management - NFS - pull data easily - 1. Why Hadoop is gamechanging - paradigm shift.2. how can you benefit - use cases categories…- saved 10 million dollars - predictive analytics. Need money. Who is MapRwhat do we do to make htat a realityend pint - what you can do with it to bring value today
  • Transcript

    • 1. 1 Big Data Lessons from the Cloud Jack Norris, MapR Technologies
    • 2. 2 Data Volume Growing 44x 2020: 35.2 Zettabytes 2010: 1.2 Zettabytes The Challenge of Big Data Business Analytics Requires a New Approach Source: IDC Digital Universe Study, sponsored by EMC, May 2010 IDC Digital Universe Study Data is Growing Faster than Moore’s Law
    • 3. 3 What are the Requirements for Big Data?  Process it quickly  Combine multiple data sources  Expand analysis
    • 4. 4 Big Data in the Cloud  Distributed, scalable computing platform – Data/Compute framework – Commodity hardware  Pioneered at Google  Commercially available as Hadoop
    • 5. 5 Important Drivers for Hadoop  Data on compute  You don’t need to know what questions to ask beforehand  Simple algorithms on Big Data  Analysis of unstructured data
    • 6. 6 Hadoop Growth
    • 7. 7 Apache Hadoop Distribution  Combination of Various Packages  Integrated, tested and hardened
    • 8. 8 Hadoop in the Cloud
    • 9. 9 Amazon Example: Elastic MapReduce (EMR) EMR provides Hadoop as a Service in the Cloud
    • 10. 10 How does it work? EMR EMR ClusterS3 You can store the data in S3 and/or on the cluster (HDFS) You decide which Hadoop distribution to run, how many nodes, and what types of nodes
    • 11. 11 EMR EMR Cluster How does it work? S3 You can easily add additional nodes
    • 12. 12 How does it work? EMR ClusterS3 When processing is complete, you can shut down the cluster (and stop paying)
    • 13. 13 Launching a Cluster
    • 14. 14 Thousands of customers, 2 million+ clusters
    • 15. 16 Hadoop in the Cloud is a Flexible Infrastructure for Big Data
    • 16. 17  MinuteSort - Amount of data that can be sorted in 60.00 seconds. – Benchmark is technology Agnostic  Previous record was 1.4TB set by Microsoft Research using specially designed software across physical hardware  Previous Hadoop MinuteSort record was 578 GB 17 Cloud Example of Scalability
    • 17. 18 A New MinuteSort World Record New World Record 1.5 TB in 60seconds 3X more data processed than the previous Hadoop Record
    • 18. 19 Previous Record 3452 physical servers Prepare datacenter Rack and stack servers Maintain hardware 2103 instances Invoke gcutil command Months Minutes Cloud Deployment Comparison
    • 19. 20 Previous Record 3452 1U servers x $4K/server = 2103 n1-standard-4-d x $.58/instance hour x 60 seconds = $13,808,000 $20.33 Cost Comparison
    • 20. 21 Use Case 1: Expand Data for Analysis
    • 21. 22 Comparing an EDW to Hadoop  Major telecom vendor  Key step in billing pipeline handled by data warehouse (EDW)  EDW at maximum capacity  Multiple rounds of software optimization already done  Revenue limiting (= career limiting) bottleneck
    • 22. 23 Transformation Extract and Load CDR billing records Billing reports Data Warehouse Customer bills Original Flow
    • 23. 24 Problem Analysis  70% of EDW load is related to call detail record (CDR) normalization –< 10% of total lines of code –CDR normalization difficult within the EDW –Binary extraction and conversion  Data rates are too high for upstream transform –Requires high volume joins
    • 24. 25 ETL CDR billing records Billing reports Data Warehouse Customer billing With ETL Offload Hadoop Cluster
    • 25. 26 ETL Offload Hadoop Distribution
    • 26. 27 Simplified Analysis  70% of EDW consumed by ETL processing – Offload frees capacity  EDW direct hardware cost is approximately $30 million vs. Hadoop cluster at 1/50 the cost  Additional EDW only increases capacity by 50% due to poor division of labor
    • 27. 28 The Results  EDW strategy –1.5 x performance –$30 million  Hadoop Strategy –3 x faster –20x cost/performance advantage for Hadoop strategy –With High Availability and data protection
    • 28. 29 Use Case 2: Combine Many Different Data Sources
    • 29. 30 Combining different feeds on one platform Hadoop and HBase Storage and Processing … Real-time data feed from social network Stored in Hadoop Historical Purchase Information Predictive Analytics from Historical data combined with NoSQL querying on real-time social networking data Billing Data
    • 30. 31 Results  New Service Rolled out in 1 quarter  Processing time cut from 20 hours per day to 3  Recommendation engine load time decreased from 8 hours to 3 minutes  Includes data versioning support for easier development and updating of models
    • 31. 32 Collect Data from Dispersed Data Sources
    • 32. 33 Leading Veterinary Equipment Mfgr  Aggregates data across 6000 veterinary clinics  Nightly extracts from each clinic  One job runs once a week for a few hours  Expanding applications to include vaccination analysis for 300M vaccinations  Predictive analytics for disease prevalence and prevention
    • 33. 34 Use Case 3: New Application from New Data Source
    • 34. 35 Ancestry.com – Family Tree
    • 35. 36 Overview and Requirements  Collect and Collate information from disparate sources (Text files, Images, etc.)  Leverage new data source: Spit  Machine learning techniques and DNA Matching Algorithms
    • 36. 37 The Results  Storage Infrastructure for billions of small and large files  Blob Store for large images through NoSQL solutions  Multi-tenant capability for data-mining and machine- learning algorithm development
    • 37. 38 Use Case 4: New Analytics on Existing Data
    • 38. 39 Analytic Flexibility  MapReduce enabled Machine learning algorithms  Enhanced Search  Real-time event processing  No need to sample the data Fraud Detection Target Marketing Consumer Behavior Analysis …
    • 39. 40 Hadoop Expands Analytics “Simple algorithms and lots of data trump complex models ” Halevy, Norvig, and Pereira, Google IEEE Intelligent Systems
    • 40. 41 Advanced Simple Analytics  Fraud detection: – Detect small frauds using transaction patterns across the entire portfolio – Identify compromise signature to prevent further exploits and provide solid case explanations  Google Flu Trends vs. Traditional Flu Surveillance systems and modeling  Netflix recommendation engine – Complex models vs. adding IMDB data
    • 41. 42 Combine Them All
    • 42. 43 Clickstream Analysis –  Big Box Retailer came to Razorfish – 3.5 billion records – 71 million unique cookies – 1.7 million targeted ads required per day Problem: Improve Return on Ad Spend (ROAS)
    • 43. 44 Clickstream Analysis – Targeted Ad User recently purchased a sports movie and is searching for video games (1.7 Million per day)
    • 44. 45 Clickstream Analysis – Processing time dropped from 2+ days to 8 hours (with lots more data)
    • 45. 46 Clickstream Analysis – Increased Return On Ad Spend by 500%
    • 46. 47 Hadoop in the Cloud/EMR applications  Targeted advertising / Clickstream analysis  Security: anti-virus, fraud detection, image recognition  Pattern matching / Recommendations  Data warehousing / BI  Bio-informatics (Genome analysis)  Financial simulation (Monte Carlo simulation)  File processing (resize jpegs, video encoding)  Web indexing
    • 47. 48 Big Data Processing … 99.999% HA Data Protection Disaster Recovery Scalability & Performance Enterprise Integration Multi- tenancy Map Reduce File-Based Applications SQL Database Search Stream Processing Batch Orientation: Enterprise Logfile Analysis ETL Offload Object Archive Fraud Detection Clickstream Analytics Real-Time Orientation: Sensor Analysis “Twitterscraping” Telematics Process Optimization Interactive Orientation: Forensic Analysis Analytic Modeling BI User Focus
    • 48. 49 Big Data Lessons from the Cloud 1. Big Data requires a new approach 2. Hadoop is a paradigm shift 3. Easy to get started with Hadoop in the Cloud 4. Scale clusters up and down in the Cloud 5. Only pay for what you use 6. Expand data for analysis 7. Combine data sources 8. New application from new data source 9. New analytics 10. Wide variety of applications appropriate for Hadoop

    ×