• Save
 hadoop 101 aug 21 2012 tohug
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

hadoop 101 aug 21 2012 tohug

  • 1,636 views
Uploaded on

Presentation delivered for Aug 21 2012 TOHUG.

Presentation delivered for Aug 21 2012 TOHUG.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,636
On Slideshare
1,632
From Embeds
4
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
4

Embeds 4

https://twitter.com 4

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Customers experience many pain points when leveraging this architecture for Big Data. Here are 3 of the most common.
  • Hadoop typically solves two types of problems:Advanced AnalyticsData processingThese go by different terms in different industriesThe applicability of these solutions is broadWe’ve successfully deployed Hadoop and helped solve a diverse set of business problems
  • FinSvc companies are realizing that they need to understand the fundamental risk in their customer base.All of a bank’s working capital originals with customers.Being able to better predict fluctuations can help them optimize how to put that capital to work.
  • FinSvc need to analyze trades both for regulatory requirements as well as for internal surveillance and detecting fraud (internal and external).To date this primarily involves looking at transactions and sampling data.Hadoop enables access to detailed data and non-transactional data.
  • FinSvc companies have many data sources and many consumers of data.Multiple data processing paths can lead to discrepancies in data as well as redundancies in work.A central repository manages all in bound data, takes requests for processing and delivers data sets.This makes the data reliable and traceable. Also FinSvc data is messy and often needs to be updated or restated.A central location can improve tracing all the dependent data sets that need to be reprocessed.
  • Bank is becoming increasingly competitive, very similar to retail.It used to be you banked with your location credit union for life.Now every company you have a different 401k, you have some 529s somewhere, checking, mortgage, etc.Competitive pressure has driven down fees (despite recent complaints about new fees).Banks now need to compete on what they can offer on top of the ubiquitous financial products.Enter personalized asset management – merge financial models of market trends with personalized portfolios and goals.Embarrassingly parallel, can be offered self-service or via a sales person.
  • Assessing actual risk exposure in investments is incredible complex.Multi-tiered instruments have lots of variables.Trends that cross the instruments have complex relationships.This is all well structured data with intricate and fluid relationships.Add that the trade volumes have skyrocketed and this clearly becomes a Hadoop problem.
  • There are regulatory requirements for trade analytics (e.g. RegNMS) that need to be audited.The margins on trades can be razor thin and there’s value in analyzing trade performance.Trade execution platforms and algorithms are incredibly complicated.This is timeseries data, which looks a lot like clickstream data.Tracing particular trades through systems – in effect sessionizing them – and comparing to performance metrics is a classic Hadoop problem.
  • There’s a yearly revolution in life sciences every time the cost of sequencing falls and the throughput doubles.The existing HPC systems can’t keep up with the amount of data.Hadoop allows scientists to combine data and processing into one scale out gridThere are already numerous libraries available to tackle these problems
  • A big challenge in our electrical grid is that the infrastructure has grown incrementally over the past 100 yearsWe can’t wholesale replace it – both because of cost and riskIn order to prevent brown outs and black outs caused by component failure the TVA (responsible for the east coast electrical grid) is analyzing for patterns that can predict likely failure.This uses a combination of supervised learning and time series indexing to detect and analyze how components are behaving
  • Smart Meters are opening up a whole new world of data about how people consumeelectricity (vs how it’s delivered).There are two particular focuses initially – one is to turn this data into education to help consumers be smarter about their electrical use.The other is to help in better capacity planning.
  • An area you might not consider as being on the cutting edge of technology is in biodiversity indexing.One of the advantages of Hadoop is that it can store any kind of data in any format.It gives you the ability to cleanse that data repeatedly and turn it into well defined structured data.If you need to adjust how you tackle that data, it’s always available in raw form.The final results can be served out of a traditional database or HBase.
  • We relay today on networks as much as we rely on electricityThis puts a heavy strain on the underlying network infrastructure.Closely monitoring those networks results in a flood of data (the largest network we’re aware of collects several hundred TB/day).Much of the monitoring is data exhaust – not fundamentally required to operating the Network but highly indicative of how it is functioning.
  • Seismic readings generate massive data volumes when mapping out the topology of the planet.These are typically collected on large storage farms keeping only sampled or aggregated measurements.Then they’re transferred to HPC grids to perform the complex model definition.Hadoop opens the door towards using standard well known libraries in parallel and run them on the same grid that is storing the data.This reduces the need for sampling and significantly speeds up processing.
  • Companies have been able to analyze customer churn based on when other customers are leaving. Hadoop for the first time helps them capture behaviors leading up to customer loss to help predict when these events are likely.This gives companies more time to respond to possible customer loss.This involves traversing the social graph (customers rarely leave one at a time) and identifying and recognizing patterns that are leading indicators.
  • Much of the discussions about brands today happens in the social media.This not only impacts the companies perception but can have a direct influence on relationships with customers and the ability to sell.Hadoop is a natural solution for gathering and contextualizing discussions about company brands and products.
  • Point of sale analysis includes many different types of data today, from standard POS data to online, coupon based and mixed.Companies need to track data from any different sources in different formats to understand their sales in depthHadoop can be used to better understand the supply chain or to incorporate external data to explain sales behaviors.
  • It used to be that prices were set varying by region or season and and updated periodically.Today pricing can be completely dynamic – especially for online retailers.And consumers are able to comparison shop with a few keystrokes.Customers also weigh the value of their purchase with time to delivery.Taking all these behaviors into account in a hyper competitive market is complex.Hadoop is being used to tackle these challenges and new techniques are being applied to understand correlations, effects of bundles and incentive discountsAnd to cluster customers by a variety of attributes, not just as one type of consumer or another.
  • Customer loyalty used to be taken for granted. The programs were designed to help track customer purchases with finer granularity.Today customer loyalty is being used to bridge the gap between purchases. When customers can easily comparison shop, it’s not clear the incentives to stay with the same vendor.Loyalty programs are being designed not just to track or encourage customers to shop but to build a relationship with the customer.So that the next time they shop, they prefer the brand that has been thinking of them and their needs.Loyalty programs can also be used to make timely offers, for example when a customer is expected to run out of a particular product, provide a coupon that offers an upsell.
  • The Internet has expanded the world of offers from candy and magazines while to wait in the checkout line to anywhere and everywhere.Using modern ad network, companies can track their customers after they’ve left their site.This opens up possibilities to re-capture customers who have not yet bought or to cross sell and upsell even after the transaction is complete.Customers use technologies such as HBase to incrementally monitor where customers are going.Algorithms can then be run on incremental data at a variety of time scales.
  • An online media group within a larger brand name company has multiple separately branded and operated sitesEach has different systems for logs including ad logs and ops logs and different techniques for processing them.Hadoop provides a centralized platform for all of these properties to collect their system logs, ad logs and ops logsHadoop is also loaded with website feeds from 3rd party providers and operational metricsThis creates a standard platform for analytics and reportingThey’re soon turning on exploratory access and will provide centralized storage services for all properties
  • A mobile ad platform measures standard metrics but most of the data is arbitrary text since it can be defined by 3rd party developersThere are multiple SLAs for reporting to advertisers as well as for data accuracyLog data is collected into HDFS and prepped then loaded into HBaseHBase is used to serve results to advertisers in a similar fashion to general purpose online analytics services
  • And online gaming vendor has multiple silos for each user interaction (registration, payments, game play, web interaction)The most popular games are very dynamic (simulating real world sports)The first goal is to grant multiple business access to all of the dataIn particular the game play metrics (telemetry data) is extremely detailed, similar to sensor dataThe second goal is for exploratory analysis for example looking at distributions in game play behavior or for event triggersA lot of the initial analysis is basic count distinct on a wide variety of attributes and combinations of attributes to look for correlated behaviorsHadoop is also used to compute online statistic such as leaderboards
  • Search quality is measured by the users ability to not only find what they want but complete the transaction or take a next stepUnderstanding the users goals is very difficult and the search trends vary over timeFundamentally improving the service and assessing quality means logging everything into HDFS and rolling up your sleevesThis customer uses Hive mostly for aggregation and sqoops the results into an RDBMS to publish to end usersAnalytics have now become a critical part of the service (e.g. generating predictive search)Now they are focusing on where analytic needs are growing and what new data about searches the business wants to see
  • Recommendation engines are popular applications on HadoopThere are a wide variety of constantly changing sources and the data is always messyAt data ingestion this requires filtering and fixed of poorly formatted dataThese process are constantly changing as the data changesData is then normalized and matched across data sourcesIn some cases this means interpretation and filling in fields, in other cases it involved aggregation across fuzzy matched identifiersThese also require quality checks
  • Measuring influence on the internet involves collecting a fire hose of data that includes opinions, references and linksThink of this as a very messy and very dynamic page rank but you’re ranking people and brandsHadoop is used to prep all the data – identify meta data and distinct topics (which change)Hadoop is also used to score the social graph and filter out bots and spamThis is all tied together with pig and java and coordinate with OozieData is then batch served in CSV and loaded into HBase to back an API

Transcript

  • 1. August 21 2012 – Toronto Hadoop User Groupa.k.a. THUGsIntroduction to Hadoop: Pretty Picture Version{Due credit to Todd’s Magic}
  • 2. Why are we here? • Become exposed to the core concepts of Hadoop • Understand the projects within Hadoop and how they fit together • Review Common Use Cases for Hadoop • Share beginner experiences with Hadoop • Ask a @$%$#-load of questions about Hadoop2 ©2011 Cloudera, Inc. All Rights Reserved.
  • 3. What I won’t be able to give you… • A complete introduction to the technology (takes too long) • Enough information to begin development or implementation of Hadoop (too complicated) • Enough information to install and configure Hadoop (I recommend you start with the Cloudera VMWare image individually or Cloudera Manager for a real cluster) • Have a hands-on Pig-fest or Hive-fest (that’s a THUG meetup to come…)3 ©2011 Cloudera, Inc. All Rights Reserved.
  • 4. Users of Cloudera Financial Retail & Web Telecom Media Consumer4 ©2011 Cloudera, Inc. All Rights Reserved.
  • 5. Hadoop Use CasesUse Case Application Industry Application Use Case Social Network Analysis Web Clickstream Sessionization Content Optimization Media Clickstream Sessionization ADVANCED ANALYTICS DATA PROCESSING Network Analytics Telco Mediation Loyalty & Promotions Retail Data Factory Analysis Fraud Analysis Financial Trade Reconciliation Entity Analysis Federal SIGINT Sequencing Analysis Bioinformatics Genome Mapping 5 ©2011 Cloudera, Inc. All Rights Reserved.
  • 6. CDH File System Mount UI Framework SDK FUSE-DFS HUE HUE SDK Workflow Scheduling Metadata APACHE OOZIE APACHE OOZIE APACHE HIVE Query / Analytics APACHE PIG, APACHE Fast Data Integration HIVE, APACHE MAHOUT Read/Write Access APACHE FLUME, APACHE HDFS, MAPREDUCE APACHE SQOOP HBASE Coordination APACHE ZOOKEEPER6 ©2012 Cloudera, Inc. | Company confidential
  • 7. Typical Data Pipeline Marts Processing Layer Data Sources Data (Temporary) Warehouse Storage Archive7 ©2011 Cloudera, Inc. All Rights Reserved.
  • 8. Typical Data Pipeline with Hadoop Hadoop Marts Oozie Result or Calculated Data Original Source Data Data Sources Pig Data Hive Warehouse MapReduce Sqoop Sqoop Flume HDFS8 ©2011 Cloudera, Inc. All Rights Reserved.
  • 9. Several advantages • Store more data, cheaply • Use commodity hardware • Scale linearly, predictably • Tolerate hardware failure • Turn data into strategic asset – Ad hoc analytics – Predictive analytics9 ©2012 Cloudera, Inc. | Company confidential
  • 10. Several more advantages • Get long term view of data • Add unstructured, semi-structured data • Change schema on the fly (late binding) • Integrate with existing infrastructure10 ©2012 Cloudera, Inc. | Company confidential
  • 11. HDFS Self-healing, high bandwidth 1 2 3 HDFS 4 2 1 1 2 1 4 2 3 3 3 5 5 5 4 5 4 HDFS breaks incoming files into blocks and stores them redundantly across the cluster.11 ©2012 Cloudera, Inc. All Rights Reserved.
  • 12. HDFS Self-healing, high bandwidth 1 2 3 HDFS 4 2 1 2 1 4 3 3 3 5 5 4 5 4 HDFS breaks incoming files into blocks and stores them redundantly across the cluster.12 ©2012 Cloudera, Inc. All Rights Reserved.
  • 13. MapReduce: Map• Records from the data source (lines out of files, rows of a database, etc.) are fed into the map function as key*value pairs: e.g., (filename, line).• map() produces one or more intermediate values along with an output key from the input. (key (key 1, int. 1, value values) s) Map (key Shuffle Final (key 1, int. Reduce Task 2, value Phase (key, value values) Task s) s) (key (key 1, int. 3, value values) s)13 ©2011 Cloudera, Inc. All Rights Reserved.
  • 14. MapReduce: Reduce• After the map phase is over, all the intermediate values for a given output key are combined together into a list• reduce() combines those intermediate values into one or more final values for that same output key (key (key 1, int. 1, value values) s) Map (key Shuffle Final (key 1, int. Reduce Task 2, value Phase (key, value values) Task s) s) (key (key 1, int. 3, value values) s)14 ©2011 Cloudera, Inc. All Rights Reserved.
  • 15. MapReduce: Execution15 ©2011 Cloudera, Inc. All Rights Reserved.
  • 16. MapReduce: WordCount Input text: The cat sat on the mat. The aardvark sat on the sofa. Mapping Shuffling Reducing The, 1 aardvark, 1 aardvark, 1 cat, 1 cat, 1 Final Result sat, 1 cat, 1 on, 1 aardvark, 1 the, 1 mat, 1 mat, 1 cat, 1 mat, 1 mat, 1 The, 1 on [1, 1] on, 2 on, 2 aardvark, 1 sat, 2 sat, 1 sat [1, 1] sat, 2 sofa, 1 on, 1 the, 4 the, 1 sofa, 1 sofa, 1 sofa, 1 the [1, 1, 1, 1] the, 416 ©2011 Cloudera, Inc. All Rights Reserved.
  • 17. Sqoop: RDBMS to HDFS17 ©2011 Cloudera, Inc. All Rights Reserved.
  • 18. Sqoop: HDFS to RDBMS18 ©2011 Cloudera, Inc. All Rights Reserved.
  • 19. FlumeNG: High-level Architecture Client Agent Client Agent Client Agent Client Channel Sink 1 Examples 1 Source Sources: Avro, netcat, exec Channel Sink 2 Channels: memory, JDBC 2 Sink: HDFS, Avro Agent19 ©2011 Cloudera, Inc. All Rights Reserved.
  • 20. HBase: Table Structure Column family “contents” Column family “anchor_text”Row Key Column Timestamp Cell Column Timestamp Cell Key KeyCom.cloudera.info 1273716197868 <html> Bar.com 1273871824184 Cloudera!... …Com.cloudera.www 1273746289103 <html> Baz.org 1273871962874 Hadoop!... …Com.foo.www 1273698729045 <html> …Com.foo.www 1273699734191 <html> Bar.gov 1273879456211 Edu.foo… …… 20 ©2011 Cloudera, Inc. All Rights Reserved.
  • 21. HBase: Architecture21 ©2011 Cloudera, Inc. All Rights Reserved.
  • 22. HiveSQL-based data warehousing application  Language is SQL-like  Features for analyzing very large data sets  Partition columns, Sampling, Buckets SELECT s.word, s.freq, k.freq FROM shakespeare JOIN ON (s.word= k.word) WHERE s.freq >= 5;22 ©2011 Cloudera, Inc. All Rights Reserved.
  • 23. PigData-flow oriented language – “Pig latin”  Datatypes include sets, associative arrays, tuples  High-level language for routing data, allows easy integration of Java for complex tasks emps = LOAD people.txt’ AS (id,name,salary); rich = FILTER emps BY salary > 200000; sorted_rich = ORDER rich BY salary DESC; STORE sorted_rich INTO ’rich_people.txt;23 ©2011 Cloudera, Inc. All Rights Reserved.
  • 24. Oozie Workflow/coordination service to manage data processing jobs for Hadoop24 ©2011 Cloudera, Inc. All Rights Reserved.
  • 25. Oozie Workflow/coordination service to manage data processing jobs for Hadoop25 ©2011 Cloudera, Inc. All Rights Reserved.
  • 26. Hadoop Security Authentication is secured by Kerberos v5 and integrated with LDAP Hadoop server can ensure that users and groups are who they say they are Job Control includes Access Control Lists, which means Jobs can specify who can view logs, counters, configurations and who can modify a job Tasks now run as the user who launched the job26 ©2011 Cloudera, Inc. All Rights Reserved.
  • 27. Typical Use Cases ©2011 Cloudera, Inc. All Rights Reserved.27
  • 28. Common Challenges1 Network Analysis and Sessionization2 Content Optimization and Engagement Modeling3 Usage Analysis and Mediation4 Entity Surveillance and Signal Monitoring5 Recommendations and Modeling6 Loyalty, Promotion Analysis and Targeting7 Fraud Analysis, Reconciliation and Risk8 Time series Analysis, Mapping and Modeling28 ©2011 Cloudera, Inc. All Rights Reserved.
  • 29. What Can Hadoop Do For You? Two Core Use Cases1 2 Applied Across Verticals INDUSTRY TERM VERTICAL INDUSTRY TERM Social Network Analysis Web Clickstream SessionizationADVANCED ANALYTICS DATA PROCESSING Content Optimization Media Engagement Network Analytics Telco Mediation Loyalty & Promotions Analysis Retail Data Factory Fraud Analysis Financial Trade Reconciliation Entity Analysis Federal SIGINT Sequencing Analysis Bioinformatics Genome Mapping29 ©2011 Cloudera, Inc. All Rights Reserved.
  • 30. Financial Services1 Customer Risk Analysis2 Surveillance and Fraud Detection3 Central Data Repository4 Personalization and Asset Management5 Market Risk Modeling6 Trade Performance Analytics30 ©2011 Cloudera, Inc. All Rights Reserved.
  • 31. Customer Risk Analysis Build comprehensive data picture of customer side risk Publish a consolidated set of attributes for analysis Map ratings across products Parse and aggregate data from difference sources Credit and debit cards, product payments, deposits and savings Banking activity, browsing behavior, call logs, e-mails and chats Merge data into a single view A “fuzzy join” among data sources Structure and normalize attributes Sentiment analysis, pattern recognition31 Copyright 2010 Cloudera Inc. All rights reserved
  • 32. Surveillance and Fraud Detection Trade surveillance records activity in a central repository Centralized logging across all execution platforms Structured and raw log data from multiple applications Pattern recognition detect anomalies/harmful behavior Feature set and timeline vector are very dynamic Schema on read provides flexibility for analysis Data is primarily served and processed in HDFS with MR Data filtering and projection in Pig and Hive Statistical modeling of data sets in R or SAS32 Copyright 2010 Cloudera Inc. All rights reserved
  • 33. Central Data Repository Financial Data messy due to many interacting systems Personal data is obfuscated for security and records get out of sync Trades need to be “sessionized” into accounts and products Discrepancies are difficult to reconcile, need to track corrections Hadoop is a centralized platform for data collection Single source for data, processing happens on the platform Metadata used to track information lifecycle Workflows run and monitor data transformation pipelines Data served via APIs or in Batch Single version of the truth, data processed and cleansed centrally Clear audit trail of data dependencies and usage33 Copyright 2010 Cloudera Inc. All rights reserved
  • 34. Personalization and Asset Mgmt Institutional and personal investing services Arms investor with sophisticated models for their positions Success measured by upsell and conversion (as well as profit) Data analysis across distinct data sources Market data and individual assets by investor Investor strategy, goals and interactive behavior Data sources combined in HDFS Models written in Pig with UDFs and generated regularly Reports for sales and fed into online recommendation system34 ©2011 Cloudera, Inc. All Rights Reserved.
  • 35. Market Risk Modeling Evaluating asset risk is very data intensive Trade volumes have increased dramatically Classic indicators at the daily level don’t provide a clear picture Trends across complex instruments can be hard to spot Models require massive brute force calculation Multiple models built in batch and in parallel Data is primarily structured and sourced from RDBMS Transactional data sqooped to combine with market feeds Resulting predictions sqooped and served via RDBMS35 ©2011 Cloudera, Inc. All Rights Reserved.
  • 36. Trade Performance Analytics Increased Demands on Trade Analytics Regulatory requirements for best price trading across exchanges Increased competition and scrutiny adds a focus on optimization Trade Analytics becomes a Clickstream problem Trade execution systems include order trails and execution logs Sessionized across order systems and combined with system logs Processing, Analysis and Audit Trail all in Hadoop KPIs summarized as regular reports written in Hive Data available for historical analysis and discovery36 ©2011 Cloudera, Inc. All Rights Reserved.
  • 37. Science and Energy1 Genomics2 Utilities and Power Grid3 Smart Meters4 Biodiversity Indexing5 Network Failures6 Seismic Data37 ©2011 Cloudera, Inc. All Rights Reserved.
  • 38. Genomics Cost of DNA Sequencing Falling Very Fast Raw data needs to be aligned and matched Scientists want to collect and analyze these sequences Hadoop Can Read Native Format hadoop-bam Java library for manipulation of Binary Alignment/Map Alignment, SNP discovery, genotyping Genomic Tools Based On Hadoop SEAL – distributed short read alignment BlastReduce – parallel read mapping Crossbow – whole genome re-sequencing analysis Cloudburst - sensitive MapReduce alignment38 Copyright 2010 Cloudera Inc. All rights reserved
  • 39. Utilities and the Power Grid Power grid is aging and maintained incrementally Failures hard to predicate and can have cascading effects Looking at vibration of transformers over time to find patterns Predicting failure of grid equipment Supervised learning to scan time series data for fuzzy patterns Identify likely faulting equipment for targeted replacement Hadoop based tools to model equipment behavior openPDC project: http://openpdc.codeplex.com Lumberyard - indexing time series data for low latency fuzzy queries39 Copyright 2010 Cloudera Inc. All rights reserved
  • 40. Smart Meter Example Workflow Looking at usage patterns in home smart meter data How to educate consumers to save energy Capacity planning for the grid Individual analysis is critical Personalized reporting to consumers Predictive modeling of peak usage and potential cost savings Hadoop for collection, reporting and analysis Collect time series samples in Hadoop Partition at various granularities and roll up reports and models40 Copyright 2010 Cloudera Inc. All rights reserved
  • 41. Biodiversity Indexing Consolidation and serving of Biological data Provide free and open access to biodiversity data Collection, search, discovery and access to a variety of data Data matching and cleansing Geography, Water/land mapping Dictionaries and taxonomic services Data is harvested into multiple RDBMS Sqoop to Hadoop for processing workflows and index generation Sqoop back to MySQL for Web app serving Future development is to crawl into and serve from HBase41 ©2011 Cloudera, Inc. All Rights Reserved.
  • 42. Preventing Network Failure Need to Model and understand Network behavior Better understanding how the network reacts to fluctuations Discrete anomalies may, in fact, be interconnected Collection and forensic analysis of emerging patterns Record the data exhaust – all metrics, logs, traffic metadata Identify leading indicators of component failure New techniques when all data is available Expand the range of indexing techniques Starting with simple scans to more complex data mining42 ©2011 Cloudera, Inc. All Rights Reserved.
  • 43. Processing Seismic DataOptimize the IO-intensive phases of seismic processing Incorporate additional parallelism where it makes sense Simplify gather/transpose operations with MapReduceSeismic Unix for Core Algorithms Well-known, used at many grad programs in geophysics SU file format can be easily transformed for processing on HDFSHadoop Streaming Seismic Unix, SEPlib, Javaseis - non-Java code in MR Framework is aware of parameter files needed by SU commands Copyright 2011 Cloudera Inc. All rights reserved
  • 44. Retail and Manufacturing1 Customer Churn2 Brand and Sentiment Analysis3 Point of Sales4 Pricing Models5 Customer Loyalty6 Targeted Offers44 ©2011 Cloudera, Inc. All Rights Reserved.
  • 45. Customer Churn Analysis Understanding Customer Behavior and Preferences Rapidly test and build behavioral model of customer Combine disparate data sources (transactional, social,etc) Structure and analyze with Hadoop Traversing usage and social graphs Pattern identification and recognition to find indicators Feature Extraction to find Root Causes Defining attributes and modeling statistical significance Combinations and sequence of attributes and actions factor in45 ©2011 Cloudera, Inc. All Rights Reserved.
  • 46. Brands and Sentiment Analysis Internet generates a lot of chatter about brands Understanding what’s being said is crucial to protecting brand value Facebook, Twitter generate a lot of data for a global top brand Capturing and Processing direct feedback Better engagement and alerting via Sentiment Analysis Not yet ready for fully automated customer service Hadoop handles the diverse data types and processing Sources of data changing and semantics continuously evolving Sophistication of algorithms is improving daily46 Copyright 2010 Cloudera Inc. All rights reserved
  • 47. Point of Sale Transaction Analysis Lot’s of machine generated data available Line items, stock, coupons, ads Stored in various formats Pattern recognition enables constant reassessment Optimizing across multiple data sources Demand prediction based on Joining multiple data sets for more insight Retail Supply Chain Weather and Financial data47 Copyright 2010 Cloudera Inc. All rights reserved
  • 48. Pricing Models Retailers have increased flexibility in pricing Comparison shopping is dynamic Customer weighs combined value and time to delivery Understand how prices affect purchasing New techniques apply such as A/B testing and spot discounts Motivations can be difficult to discern, need to look for correlations Combinations multiply, Hadoop provides scale to analyze Bundles can have incentive discounts Clustering and supervised learning to group attributes48 ©2011 Cloudera, Inc. All Rights Reserved.
  • 49. Customer Loyalty Comparison shopping is making Retail hyper-competitive Discount programs, e-mail correspondence entice shoppers Brand loyalty means attention to detail and service Customer lifecycle is more than purchases Browsing and online data used to capture customer attention Loyalty programs bridge the gap between purchases Reach into online channels Online engagement is personalized just as in store Connecting online and in store shows customer awareness49 ©2011 Cloudera, Inc. All Rights Reserved.
  • 50. Targeted Offers The checkout lane is everywhere Cookies track users through ad impressions Purchasing behavior is time sensitive Logs collected from on-site and off-site browsing Data is ingested incrementally Process happens at a variety of time scales Data logged to HBase as primary store Some events naturally associate, others require deeper analysis Random access useful for debugging algorithms50 ©2011 Cloudera, Inc. All Rights Reserved.
  • 51. Web and e-Commerce1 Online Media2 Mobile3 Online Gaming4 Search Quality5 Recommendations6 Influence51 ©2011 Cloudera, Inc. All Rights Reserved.
  • 52. Online Media Centralized platform for consolidated log processing Many online properties each with separate sys, ad, ops logs Different standards and techniques for processing Data feeds are varied Advertising logs, website traffic feeds from 3rd party providers, system logs, application logs and other operational metrics Data pipeline can be normalized Cleansing, standard analytics and reporting Soon an exploratory platform as well as storage across all properties52 ©2011 Cloudera, Inc. All Rights Reserved.
  • 53. Mobile Mobile advertisement platform Measuring metrics impressions, clicks, actions and conversions. Most metrics are arbitrary text strings (data is dirty) Stringent SLAs for delivering results SLA of several minutes between event and report to advertisers SLA also covers data accuracy Hadoop for ETL, Analytics, reporting HBase for serving results to advertisers Mimics the popular online analytics services53 ©2011 Cloudera, Inc. All Rights Reserved.
  • 54. Online Gaming Consolidating data silos for a holistic view of users Various silos of data – user reg, financial, game play, web Poplar games simulate real world sports First goal is accessibility Multiple business can access all data Game play metrics are extremely detailed (think sensor data) Second is exploratory Distributions, event triggers, distinct counts and association rates Compute online statistics such as leaderboards54 ©2011 Cloudera, Inc. All Rights Reserved.
  • 55. Search Quality Understand user search behavior Improve service, assess quality of results Understand load, identify trends, generate predictive search Search query logs stored in HDFS Hive based aggregation Sqoop to RDBMS for end user analytics Now focused on internal monitoring Analytics have become a critical part of the service Where are analytic needs growing? What data about searches do people want to see?55 ©2011 Cloudera, Inc. All Rights Reserved.
  • 56. Recommendations and Forecasting Collect and serve personalization information Wide variety of constantly changing data sources Data guaranteed to be messy Data ingestion includes collection of raw data Filtering and fixing of poorly formatted data Normalization and matching across data sources Analysis looks for reliable attributes and groupings Interpretation (e.g. gender by name) Aggregation across likely matching identifiers Identify possible predicted attributes or preferences56 Copyright 2010 Cloudera Inc. All rights reserved
  • 57. Influence Collect a fire hose of data about social commentary Personal opinions, references to opinions, links Look for tracking and referencing (like very messy page rank) Hadoop to bucket and prepare for analysis Meta data and distinct topics Social graph scoring, bot and spam detection Hadoop stack used throughout Pig and Java, coordinated with Oozie Batch serve data in CSV and load to HBase for API servers57 ©2011 Cloudera, Inc. All Rights Reserved.
  • 58. August 2012Cloudera UniversitySarah Sproehnle
  • 59. Why invest in training?• Maximize your investment in a new technology• Make fewer mistakes by learning the best practices• Cheaper and easier to cross-train than hire – Existing DBAs, Analysts and System Administrators can become Hadoop users
  • 60. Cloudera University • Experience – We’ve trained over 12,000 people – Our courses incorporate the best practices that Cloudera has learned from supporting our customers • Depth of courseware – A comprehensive, role-based curriculum – We can train your entire staff in all aspects of CDH • Geographical coverage – We offer public and private classes in over 20 countries including US, Canada, Brazil, Germany, UK, Poland, Spain, Israel, France, The Netherlands, South Africa, China, India, Australia and Singapore • Certification – Available worldwide at Pearson VUE (vouchers included in our courses) – Certifications for Developers (CCDH), Admins (CCAH), and HBase (CCSHB)60 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 61. Value proposition of private training• 12k/day for up to 20 students – NEW: 8k/day for up to 10 students – Price includes courseware, lab materials, cert vouchers (for Dev, Admin, HBase), and T&E• We can tailor a class – We have ~ 3 weeks of content that we can mix and match into a customized class – Saves the customer’s time by covering the most relevant topics, cutting out non essential material• Customer chooses location and date• We’re under NDA
  • 62. Learning paths62 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.