2. 2
Data Volume
Growing 44x
2020: 35.2
Zettabytes
2010:
1.2
Zettabytes
The Challenge of Big Data
Business Analytics Requires a New Approach
Source: IDC Digital Universe Study, sponsored by EMC, May 2010
IDC
Digital Universe
Study
Data is Growing Faster than Moore’s Law
3. 3
What are the Requirements for Big Data?
Process it quickly
Combine multiple
data sources
Expand analysis
4. 4
Big Data in the Cloud
Distributed, scalable computing platform
– Data/Compute framework
– Commodity hardware
Pioneered at Google
Commercially available as Hadoop
5. 5
Important Drivers for Hadoop
Data on compute
You don’t need to know what
questions to ask beforehand
Simple algorithms on Big Data
Analysis of unstructured data
10. 10
How does it work?
EMR
EMR ClusterS3
You can store the
data in S3 and/or on
the cluster (HDFS)
You decide which Hadoop
distribution to run, how many
nodes, and what types of nodes
15. 16
Hadoop in the Cloud is a Flexible
Infrastructure for Big Data
16. 17
MinuteSort - Amount of data that can be sorted in 60.00 seconds.
– Benchmark is technology Agnostic
Previous record was 1.4TB set by Microsoft Research using
specially designed software across physical hardware
Previous Hadoop MinuteSort record was 578 GB
17
Cloud Example of Scalability
17. 18
A New MinuteSort World Record
New World Record
1.5 TB in 60seconds
3X more data processed
than the previous
Hadoop Record
21. 22
Comparing an EDW to Hadoop
Major telecom vendor
Key step in billing pipeline
handled by data warehouse
(EDW)
EDW at maximum capacity
Multiple rounds of software
optimization already done
Revenue limiting (= career
limiting) bottleneck
23. 24
Problem Analysis
70% of EDW load is related to call detail record (CDR)
normalization
–< 10% of total lines of code
–CDR normalization difficult within the EDW
–Binary extraction and conversion
Data rates are too high for upstream transform
–Requires high volume joins
26. 27
Simplified Analysis
70% of EDW consumed by ETL processing – Offload
frees capacity
EDW direct hardware cost is approximately $30 million
vs. Hadoop cluster at 1/50 the cost
Additional EDW only increases capacity by 50% due to
poor division of labor
27. 28
The Results
EDW strategy
–1.5 x performance
–$30 million
Hadoop Strategy
–3 x faster
–20x cost/performance advantage for Hadoop strategy
–With High Availability and data protection
29. 30
Combining different feeds on one platform
Hadoop and HBase
Storage and Processing
…
Real-time data feed
from social network
Stored in
Hadoop
Historical
Purchase
Information
Predictive Analytics from
Historical data combined with
NoSQL querying on real-time
social networking data
Billing
Data
30. 31
Results
New Service Rolled out in 1 quarter
Processing time cut from 20 hours per day to 3
Recommendation engine load time decreased from 8
hours to 3 minutes
Includes data versioning support for easier
development and updating of models
32. 33
Leading Veterinary Equipment Mfgr
Aggregates data across 6000 veterinary clinics
Nightly extracts from each clinic
One job runs once a week for a few hours
Expanding applications to include vaccination analysis for 300M
vaccinations
Predictive analytics for disease prevalence and prevention
35. 36
Overview and Requirements
Collect and Collate information from disparate sources
(Text files, Images, etc.)
Leverage new data source: Spit
Machine learning techniques and DNA Matching
Algorithms
36. 37
The Results
Storage Infrastructure for billions of small and large
files
Blob Store for large images through NoSQL solutions
Multi-tenant capability for data-mining and machine-
learning algorithm development
38. 39
Analytic Flexibility
MapReduce enabled Machine learning algorithms
Enhanced Search
Real-time event processing
No need to sample the data
Fraud Detection Target Marketing
Consumer
Behavior Analysis …
39. 40
Hadoop Expands Analytics
“Simple algorithms and lots of data
trump complex models ”
Halevy, Norvig, and Pereira, Google
IEEE Intelligent Systems
40. 41
Advanced Simple Analytics
Fraud detection:
– Detect small frauds using transaction patterns across the entire
portfolio
– Identify compromise signature to prevent further exploits and
provide solid case explanations
Google Flu Trends vs. Traditional Flu Surveillance
systems and modeling
Netflix recommendation engine
– Complex models vs. adding IMDB data
42. 43
Clickstream Analysis –
Big Box Retailer came to Razorfish
– 3.5 billion records
– 71 million unique cookies
– 1.7 million targeted ads required per day
Problem: Improve Return on Ad Spend (ROAS)
48. 49
Big Data Lessons from the Cloud
1. Big Data requires a new approach
2. Hadoop is a paradigm shift
3. Easy to get started with Hadoop in the Cloud
4. Scale clusters up and down in the Cloud
5. Only pay for what you use
6. Expand data for analysis
7. Combine data sources
8. New application from new data source
9. New analytics
10. Wide variety of applications appropriate for Hadoop
Editor's Notes
Map Reduce is a paradigm shiftGoogle Poster ChildWhat exactly does Hadoop look like?
There are many drivers for Hadoop adoption…
Let’s start with this chart. To reinforce you’re in the right room you picked the right session…Hadoop Not only is it the fastest growing Big Data technology…It is one of the fastest technologies period….Hadoop adoption is happening across industries and across a wide range of application areas.What’s driving this adoption
This is a Hadoop distribution it includes a series of open source packages that are tested, hardened and combined into a complete suite. With MapR we’ve combined this with our own innovations at the data platform level to make it highly available, dependable and easier to access and integrate through industry standards like NFS, ODBC, etc…
How do you benefit. I mentioned that used wide variety of use cases…I’ve generalized these into 4 groups… The first
Is expanding data….Sampled to all of the transactions, ….. Netflix….recommends 5 movies to you and. It’s because they look at everybody’s movie watching and ratings and identify like clusters of individuals like you….Risk triangles for insurance companies go from zip code level down to the neighborhood street…Trading information going for last 3 months to 7 years….
Let’s look at a specific example…
Load CDR – Call detail records into the data warehouse and transform data into the proper format for processing and analysis…
The problem with this process is that 70% of the EDW load is related to the CDR normalization process AI: Why is this the case?CDR normalization difficult within the EDWBinary extraction and conversion to SQL is difficult
IDEXX (Current client M3 on EMR) IDEXX is the leader in veterinary equipment and also make software for clinics, etc. Aggregating some data from veterinary clinics that have IDEXX software. MapR cluster internally with 4-5 servers at the time, using that successfully for a few months. Terry went to the AWS conference in November, and learned about EMR. Tried it out, liked the flexibility especially in their use case where there aren't jobs all the time. Example: One job runs once a week for a few hours. 6000 veterinary practices. Each night receive a data extract from each one (pipe-delimited file). Includes all the products that were sold that day. Hadoop is used for aggregations, then use Sqoop to load into another Oracle database for the analysts. Now they have another project. This project is compiled using Java 7 and they use some features for Java 7 (and it's part of a much larger project that uses Java 7). AI Itay: Send them the exact instructions for using Java 7 with MapR/EMR. Processing similar data to the first project. In this case, they are creating a list of vaccinations for each animal. Provide a portal to end-users with all the medical details.
The first is “simple algorithms and lots of data trump complex models”. This comes from an IEEE article written by 3 research directors at Google. The article was titled the “Unreasonable effectiveness of Data” it was reaction to an article called “The Unreasonable Effectives of Mathematics in Natural Science” This paper made the point that simple formulas can explain the complex natural world. The most famous example being E=MC2 in physics. Their paper talked about how economist were jealous since they lacked similar models to neatly explain human behavior. But they found that in the area of Natural Language Processing an area notoriously complex that has been studied for years with many AI attempts at addressing this. They found that relatively simple approaches on massive data produced stunning results. They cited an example of scene completion. An algorithm is used to eliminate something in a picture a car for instance and based on a corpus of thousands of pictures fill in the the missing background. Well this algorithm did rather poorly until they increased the corpus to millions of photos and with this amount of data the same algorithm performed extremely well. While not a direct example from financial services I think it’s a great analogy. After all aren’t you looking for an approach that can fill in the missing pieces of a picture or pattern.
Okay interesting graphs how does this translate to the real world. Here are some broad examples.
Start with the right platform…Power to address your needs and the flexibility to grow with your expansion..----- Meeting Notes (4/3/13 14:27) -----examples of functionality that makes applicatoins better…custom codeintegrate time to marketproduction gradeRSA - security event management - NFS - pull data easily - 1. Why Hadoop is gamechanging - paradigm shift.2. how can you benefit - use cases categories…- saved 10 million dollars - predictive analytics. Need money. Who is MapRwhat do we do to make htat a realityend pint - what you can do with it to bring value today