31. Customer Risk Analysis
Build comprehensive data picture of customer side risk
Publish a consolidated set of attributes for analysis
Map ratings across products
Parse and aggregate data from difference sources
Credit and debit cards, product payments, deposits and savings
Banking activity, browsing behavior, call logs, e-mails and chats
Merge data into a single view
A “fuzzy join” among data sources
Structure and normalize attributes
Sentiment analysis, pattern recognition
31
Copyright 2010 Cloudera Inc. All rights reserved
32. Surveillance and Fraud Detection
Trade surveillance records activity in a central
repository
Centralized logging across all execution platforms
Structured and raw log data from multiple applications
Pattern recognition detect anomalies/harmful behavior
Feature set and timeline vector are very dynamic
Schema on read provides flexibility for analysis
Data is primarily served and processed in HDFS with MR
Data filtering and projection in Pig and Hive
Statistical modeling of data sets in R or SAS
32
Copyright 2010 Cloudera Inc. All rights reserved
33. Central Data Repository
Financial Data messy due to many interacting systems
Personal data is obfuscated for security and records get out of sync
Trades need to be “sessionized” into accounts and products
Discrepancies are difficult to reconcile, need to track corrections
Hadoop is a centralized platform for data collection
Single source for data, processing happens on the platform
Metadata used to track information lifecycle
Workflows run and monitor data transformation pipelines
Data served via APIs or in Batch
Single version of the truth, data processed and cleansed centrally
Clear audit trail of data dependencies and usage
33
Copyright 2010 Cloudera Inc. All rights reserved
38. Genomics
Cost of DNA Sequencing Falling Very Fast
Raw data needs to be aligned and matched
Scientists want to collect and analyze these sequences
Hadoop Can Read Native Format
hadoop-bam Java library for manipulation of Binary Alignment/Map
Alignment, SNP discovery, genotyping
Genomic Tools Based On Hadoop
SEAL – distributed short read alignment
BlastReduce – parallel read mapping
Crossbow – whole genome re-sequencing analysis
Cloudburst - sensitive MapReduce alignment
38
Copyright 2010 Cloudera Inc. All rights reserved
39. Utilities and the Power Grid
Power grid is aging and maintained incrementally
Failures hard to predicate and can have cascading effects
Looking at vibration of transformers over time to find patterns
Predicting failure of grid equipment
Supervised learning to scan time series data for fuzzy patterns
Identify likely faulting equipment for targeted replacement
Hadoop based tools to model equipment behavior
openPDC project: http://openpdc.codeplex.com
Lumberyard - indexing time series data for low latency fuzzy queries
39
Copyright 2010 Cloudera Inc. All rights reserved
40. Smart Meter Example Workflow
Looking at usage patterns in home smart meter data
How to educate consumers to save energy
Capacity planning for the grid
Individual analysis is critical
Personalized reporting to consumers
Predictive modeling of peak usage and potential cost savings
Hadoop for collection, reporting and analysis
Collect time series samples in Hadoop
Partition at various granularities and roll up reports and models
40
Copyright 2010 Cloudera Inc. All rights reserved
43. Processing Seismic Data
Optimize the IO-intensive phases of seismic processing
Incorporate additional parallelism where it makes sense
Simplify gather/transpose operations with MapReduce
Seismic Unix for Core Algorithms
Well-known, used at many grad programs in geophysics
SU file format can be easily transformed for processing on HDFS
Hadoop Streaming
Seismic Unix, SEPlib, Javaseis - non-Java code in MR
Framework is aware of parameter files needed by SU commands
Copyright 2011 Cloudera Inc. All rights reserved
46. Brands and Sentiment Analysis
Internet generates a lot of chatter about brands
Understanding what’s being said is crucial to protecting brand value
Facebook, Twitter generate a lot of data for a global top brand
Capturing and Processing direct feedback
Better engagement and alerting via Sentiment Analysis
Not yet ready for fully automated customer service
Hadoop handles the diverse data types and processing
Sources of data changing and semantics continuously evolving
Sophistication of algorithms is improving daily
46
Copyright 2010 Cloudera Inc. All rights reserved
47. Point of Sale Transaction Analysis
Lot’s of machine generated data available
Line items, stock, coupons, ads
Stored in various formats
Pattern recognition enables constant reassessment
Optimizing across multiple data sources
Demand prediction based on
Joining multiple data sets for more insight
Retail Supply Chain
Weather and Financial data
47
Copyright 2010 Cloudera Inc. All rights reserved
56. Recommendations and Forecasting
Collect and serve personalization information
Wide variety of constantly changing data sources
Data guaranteed to be messy
Data ingestion includes collection of raw data
Filtering and fixing of poorly formatted data
Normalization and matching across data sources
Analysis looks for reliable attributes and groupings
Interpretation (e.g. gender by name)
Aggregation across likely matching identifiers
Identify possible predicted attributes or preferences
56
Copyright 2010 Cloudera Inc. All rights reserved
59. Why invest in training?
• Maximize your investment in a new
technology
• Make fewer mistakes by learning the best
practices
• Cheaper and easier to cross-train than
hire
– Existing DBAs, Analysts and System
Administrators can become Hadoop users
61. Value proposition of private training
• 12k/day for up to 20 students
– NEW: 8k/day for up to 10 students
– Price includes courseware, lab materials, cert
vouchers (for Dev, Admin, HBase), and T&E
• We can tailor a class
– We have ~ 3 weeks of content that we can mix
and match into a customized class
– Saves the customer’s time by covering the most
relevant topics, cutting out non essential material
• Customer chooses location and date
• We’re under NDA
Customers experience many pain points when leveraging this architecture for Big Data. Here are 3 of the most common.
Hadoop typically solves two types of problems:Advanced AnalyticsData processingThese go by different terms in different industriesThe applicability of these solutions is broadWe’ve successfully deployed Hadoop and helped solve a diverse set of business problems
FinSvc companies are realizing that they need to understand the fundamental risk in their customer base.All of a bank’s working capital originals with customers.Being able to better predict fluctuations can help them optimize how to put that capital to work.
FinSvc need to analyze trades both for regulatory requirements as well as for internal surveillance and detecting fraud (internal and external).To date this primarily involves looking at transactions and sampling data.Hadoop enables access to detailed data and non-transactional data.
FinSvc companies have many data sources and many consumers of data.Multiple data processing paths can lead to discrepancies in data as well as redundancies in work.A central repository manages all in bound data, takes requests for processing and delivers data sets.This makes the data reliable and traceable. Also FinSvc data is messy and often needs to be updated or restated.A central location can improve tracing all the dependent data sets that need to be reprocessed.
Bank is becoming increasingly competitive, very similar to retail.It used to be you banked with your location credit union for life.Now every company you have a different 401k, you have some 529s somewhere, checking, mortgage, etc.Competitive pressure has driven down fees (despite recent complaints about new fees).Banks now need to compete on what they can offer on top of the ubiquitous financial products.Enter personalized asset management – merge financial models of market trends with personalized portfolios and goals.Embarrassingly parallel, can be offered self-service or via a sales person.
Assessing actual risk exposure in investments is incredible complex.Multi-tiered instruments have lots of variables.Trends that cross the instruments have complex relationships.This is all well structured data with intricate and fluid relationships.Add that the trade volumes have skyrocketed and this clearly becomes a Hadoop problem.
There are regulatory requirements for trade analytics (e.g. RegNMS) that need to be audited.The margins on trades can be razor thin and there’s value in analyzing trade performance.Trade execution platforms and algorithms are incredibly complicated.This is timeseries data, which looks a lot like clickstream data.Tracing particular trades through systems – in effect sessionizing them – and comparing to performance metrics is a classic Hadoop problem.
There’s a yearly revolution in life sciences every time the cost of sequencing falls and the throughput doubles.The existing HPC systems can’t keep up with the amount of data.Hadoop allows scientists to combine data and processing into one scale out gridThere are already numerous libraries available to tackle these problems
A big challenge in our electrical grid is that the infrastructure has grown incrementally over the past 100 yearsWe can’t wholesale replace it – both because of cost and riskIn order to prevent brown outs and black outs caused by component failure the TVA (responsible for the east coast electrical grid) is analyzing for patterns that can predict likely failure.This uses a combination of supervised learning and time series indexing to detect and analyze how components are behaving
Smart Meters are opening up a whole new world of data about how people consumeelectricity (vs how it’s delivered).There are two particular focuses initially – one is to turn this data into education to help consumers be smarter about their electrical use.The other is to help in better capacity planning.
An area you might not consider as being on the cutting edge of technology is in biodiversity indexing.One of the advantages of Hadoop is that it can store any kind of data in any format.It gives you the ability to cleanse that data repeatedly and turn it into well defined structured data.If you need to adjust how you tackle that data, it’s always available in raw form.The final results can be served out of a traditional database or HBase.
We relay today on networks as much as we rely on electricityThis puts a heavy strain on the underlying network infrastructure.Closely monitoring those networks results in a flood of data (the largest network we’re aware of collects several hundred TB/day).Much of the monitoring is data exhaust – not fundamentally required to operating the Network but highly indicative of how it is functioning.
Seismic readings generate massive data volumes when mapping out the topology of the planet.These are typically collected on large storage farms keeping only sampled or aggregated measurements.Then they’re transferred to HPC grids to perform the complex model definition.Hadoop opens the door towards using standard well known libraries in parallel and run them on the same grid that is storing the data.This reduces the need for sampling and significantly speeds up processing.
Companies have been able to analyze customer churn based on when other customers are leaving. Hadoop for the first time helps them capture behaviors leading up to customer loss to help predict when these events are likely.This gives companies more time to respond to possible customer loss.This involves traversing the social graph (customers rarely leave one at a time) and identifying and recognizing patterns that are leading indicators.
Much of the discussions about brands today happens in the social media.This not only impacts the companies perception but can have a direct influence on relationships with customers and the ability to sell.Hadoop is a natural solution for gathering and contextualizing discussions about company brands and products.
Point of sale analysis includes many different types of data today, from standard POS data to online, coupon based and mixed.Companies need to track data from any different sources in different formats to understand their sales in depthHadoop can be used to better understand the supply chain or to incorporate external data to explain sales behaviors.
It used to be that prices were set varying by region or season and and updated periodically.Today pricing can be completely dynamic – especially for online retailers.And consumers are able to comparison shop with a few keystrokes.Customers also weigh the value of their purchase with time to delivery.Taking all these behaviors into account in a hyper competitive market is complex.Hadoop is being used to tackle these challenges and new techniques are being applied to understand correlations, effects of bundles and incentive discountsAnd to cluster customers by a variety of attributes, not just as one type of consumer or another.
Customer loyalty used to be taken for granted. The programs were designed to help track customer purchases with finer granularity.Today customer loyalty is being used to bridge the gap between purchases. When customers can easily comparison shop, it’s not clear the incentives to stay with the same vendor.Loyalty programs are being designed not just to track or encourage customers to shop but to build a relationship with the customer.So that the next time they shop, they prefer the brand that has been thinking of them and their needs.Loyalty programs can also be used to make timely offers, for example when a customer is expected to run out of a particular product, provide a coupon that offers an upsell.
The Internet has expanded the world of offers from candy and magazines while to wait in the checkout line to anywhere and everywhere.Using modern ad network, companies can track their customers after they’ve left their site.This opens up possibilities to re-capture customers who have not yet bought or to cross sell and upsell even after the transaction is complete.Customers use technologies such as HBase to incrementally monitor where customers are going.Algorithms can then be run on incremental data at a variety of time scales.
An online media group within a larger brand name company has multiple separately branded and operated sitesEach has different systems for logs including ad logs and ops logs and different techniques for processing them.Hadoop provides a centralized platform for all of these properties to collect their system logs, ad logs and ops logsHadoop is also loaded with website feeds from 3rd party providers and operational metricsThis creates a standard platform for analytics and reportingThey’re soon turning on exploratory access and will provide centralized storage services for all properties
A mobile ad platform measures standard metrics but most of the data is arbitrary text since it can be defined by 3rd party developersThere are multiple SLAs for reporting to advertisers as well as for data accuracyLog data is collected into HDFS and prepped then loaded into HBaseHBase is used to serve results to advertisers in a similar fashion to general purpose online analytics services
And online gaming vendor has multiple silos for each user interaction (registration, payments, game play, web interaction)The most popular games are very dynamic (simulating real world sports)The first goal is to grant multiple business access to all of the dataIn particular the game play metrics (telemetry data) is extremely detailed, similar to sensor dataThe second goal is for exploratory analysis for example looking at distributions in game play behavior or for event triggersA lot of the initial analysis is basic count distinct on a wide variety of attributes and combinations of attributes to look for correlated behaviorsHadoop is also used to compute online statistic such as leaderboards
Search quality is measured by the users ability to not only find what they want but complete the transaction or take a next stepUnderstanding the users goals is very difficult and the search trends vary over timeFundamentally improving the service and assessing quality means logging everything into HDFS and rolling up your sleevesThis customer uses Hive mostly for aggregation and sqoops the results into an RDBMS to publish to end usersAnalytics have now become a critical part of the service (e.g. generating predictive search)Now they are focusing on where analytic needs are growing and what new data about searches the business wants to see
Recommendation engines are popular applications on HadoopThere are a wide variety of constantly changing sources and the data is always messyAt data ingestion this requires filtering and fixed of poorly formatted dataThese process are constantly changing as the data changesData is then normalized and matched across data sourcesIn some cases this means interpretation and filling in fields, in other cases it involved aggregation across fuzzy matched identifiersThese also require quality checks
Measuring influence on the internet involves collecting a fire hose of data that includes opinions, references and linksThink of this as a very messy and very dynamic page rank but you’re ranking people and brandsHadoop is used to prep all the data – identify meta data and distinct topics (which change)Hadoop is also used to score the social graph and filter out bots and spamThis is all tied together with pig and java and coordinate with OozieData is then batch served in CSV and loaded into HBase to back an API