• Save
Data Science with Hadoop: A Primer
Upcoming SlideShare
Loading in...5

Data Science with Hadoop: A Primer



Apache Hadoop is quickly becoming the technology of choice for organizations investing in big data, powering their next generation data architecture. With Hadoop serving as both a scalable data ...

Apache Hadoop is quickly becoming the technology of choice for organizations investing in big data, powering their next generation data architecture. With Hadoop serving as both a scalable data platform and computational engine, data science is re-emerging as a center-piece of enterprise innovation, with applied data solutions such as online product recommendation, automated fraud detection and customer sentiment analysis. In this talk Ofer will provide an overview of data science and how to take advantage of Hadoop for large scale data science projects: * What is data science? * How can techniques like classification, regression, clustering and outlier detection help your organization? * What questions do you ask and which problems do you go after? * How do you instrument and prepare your organization for applied data science with Hadoop? * Who do you hire to solve these problems? You will learn how to plan, design and implement a data science project with Hadoop



Total Views
Views on SlideShare
Embed Views



7 Embeds 123

https://twitter.com 116
http://tweetedtimes.com 2
http://www.linkedin.com 1
https://m.facebook.com&_=1375019846385 HTTP 1
https://www.facebook.com 1
https://m.facebook.com&_=1375138839304 HTTP 1 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Data science is not new. But now we need to do it with much larger datasets.
  • As the volume of data has exploded, we increasingly see organizations acknowledge that not all data belongs in a traditional database. The drivers are both cost (as volumes grow, database licensing costs can become prohibitive) and technology (databases are not optimized for very large datasets).Instead, we increasingly see Hadoop – and HDP in particular – being introduced as a complement to the traditional approaches. It is not replacing the database but rather is a complement: and as such, must integrate easily with existing tools and approaches. This means it must interoperate with:Existing applications – such as Tableau, SAS, Business Objects, etc,Existing databases and data warehouses for loading data to / from the data warehouseDevelopment tools used for building custom applicationsOperational tools for managing and monitoring

Data Science with Hadoop: A Primer Data Science with Hadoop: A Primer Presentation Transcript

  • © Hortonworks Inc. 2013 Hortonworks Data Science with Hadoop – A Primer Hadoop Summit, June 2013 Ofer Mendelevitch ofer@hortonworks.com @ofermend
  • © Hortonworks Inc. 2013 Page 2 Who am I? currently <- c( role=“director of data sciences”, company=“Hortonworks”) • Previously: Nor1, Yahoo!, Risk Insight, Quiver, etc… • Blog: www.achessdad.com
  • © Hortonworks Inc. 2013 Page 3 What I will be talking about? •What is Data Science? •Hadoop and Data Science •Use-cases: data science with Hadoop •How to get started? View slide
  • © Hortonworks Inc. 2013 Page 4 What is Data Science? What is a data scientist? A person who does this Data Product: software product whose core functionality relies on applying statistical (or machine learning) methods to data. What is Data Science? The art of building data products View slide
  • © Hortonworks Inc. 2013 Page 5 Data science & big data
  • © Hortonworks Inc. 2013 Page 6 With Hadoop… Time and cost of building large scale data products is dramatically reduced
  • © Hortonworks Inc. 2013 ApplianceCloudOS / VM An Apache Hadoop Platform HORTONWORKS DATA PLATFORM (HDP) PLATFORM SERVICES HADOOP CORE Enterprise Readiness: HA, DR, Snapshots, Security, … Distributed Storage & ProcessingHDFS MAP REDUCE DATA SERVICES Store, Process and Access Data HCATALOG HIVEPIG HBASE SQOOP FLUME OPERATIONAL SERVICES Manage & Operate at Scale OOZIE AMBARI
  • © Hortonworks Inc. 2013 A typical Big Data Architecture Page 8 APPLICATIONSDATASYSTEMS TRADITIONAL REPOS RDBMS EDW MPP DATASOURCES MOBILE DATA OLTP, POS SYSTEMS OPERATIONAL TOOLS MANAGE & MONITOR Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media) DEV & DATA TOOLS BUILD & TEST Business Analytics Custom Applications Packaged Applications HORTONWORKS DATA PLATFORM
  • © Hortonworks Inc. 2013 Page 9 Keys to Hadoop’s power • Computation co-located with data – Data and computation system co-designed to work together • Affordable at scale – Use “commodity” hardware nodes – Self-healing; failure handled by software – Very good at batch processing of large datasets
  • © Hortonworks Inc. 2013 Page 10 Hadoop improves productivity of data scientists •All data in one place –Ability to store all the data in raw format –Data silo convergence –Data scientists will find innovative uses of combined data assets •Data/compute capabilities available as shared asset –Data scientists can quickly prototype a new idea without an up-front request for funding
  • © Hortonworks Inc. 2013 Page 11 Data-driven innovation is accelerated since Hadoop is “schema on read” I need new data Finally, w e start collecting Let me see… is it any good? Start 6 months 9 months “Schema change” project Let’s just put it in a folder on HDFS Let me see… is it any good? 3 months My model is awesome!
  • © Hortonworks Inc. 2013 Page 12 Hadoop is ideal for pre-processing of large raw datasets Strip away HTML/PDF/DOC/P PT Entity resolution Document vector generation Sampling, filtering Joins Raw Data Processed Data Term normalization
  • © Hortonworks Inc. 2013 Page 13 In machine learning, very often: more data -> better outcomes Banko & Brill, 2001 •More examples to learn from •More possible feature types –We’re looking for the most useful for our task
  • © Hortonworks Inc. 2013 Page 14 Use-cases
  • © Hortonworks Inc. 2013 Page 15 A (partial) map of data science “tasks” Discovery Clustering Detect natural groupings Outlier detection Detect anomalies Affinity Analysis Co-occurrence patterns Prediction Classification Predict a category Regression Predict a value Recommendation Predict a preference Big Data Science: High energy physics, Genomics, etc
  • © Hortonworks Inc. 2013 Page 16 Use-case: product recommendation •Inputs: –Explicit product ratings (when provided) –Implicit information: purchase transactions, page views, comments 5 2 4 ? ? ? ? 5 2 ? 1 2 ? ? 3 ? 2 3 1 5 Epic X-Men Hobbit Argo Pirates U101 U102 U103 U104 U105 … Ratings Page views Forum Comments
  • © Hortonworks Inc. 2013 Page 17 Goal: predict a preference 5 2 4 ? ? ? ? 5 2 ? 1 2 ? ? 3 ? 2 3 1 5 Epic X-Men Hobbit Argo Pirates 5 2 4 1 3 4 1 5 2 3 1 2 4 1 3 3 2 3 1 5 U101 U102 U103 U104 U105 … U101 U102 U103 U104 U105 … Epic X-Men Hobbit Argo Pirates
  • © Hortonworks Inc. 2013 Page 18 Using Hadoop for recommendation Pre-process SQL Online serving HDFS Map Reduce Transactions Page views Content Recommend Data sources Custom Logic With Hadoop, we can process very large preference datasets
  • © Hortonworks Inc. 2013 Page 19 Use-case: failure prediction •Inputs: –Equipment history: install date, model, past issues –Equipment sensor data –Product catalog: product families, expected lifetime SKU Install date Service Person ID Zip code Avg temp TTF (days) 113454 5/1/2011 1345 94002 72 180 998323 5/3/2009 3234 88321 68 450 345375 8/2/2005 1112 53323 82 332 … … … … history Sensor data Product Catalog
  • © Hortonworks Inc. 2013 Page 20 Building a prediction model SKU Install date Service Person ID Zip code Avg temp TTF (days) 113454 5/1/2011 1345 94002 72 180 998323 5/3/2009 3234 88321 68 450 345375 8/2/2005 1112 53323 82 332 … … … … Unseen data Model TTF Labeled Data SKU Install date Service Person ID Zip code Avg temp 332456 3/3/2013 1345 94005 71 442343 6/6/2013 1112 77485 67
  • © Hortonworks Inc. 2013 Page 21 Using Hadoop for failure prediction • HDFS: central repository for all data – Service records (word, pdf, etc) – Equipment purchase transaction data – Product catalog: SKUs, model numbers, etc • Pre-process – Convert service records to item features: remove PDF formatting, detect entities in records – Normalize data using service records, product catalog – Create feature matrix; ready for modeling algorithm
  • © Hortonworks Inc. 2013 Page 22 Use-case: SaaS application security •Inputs: –Click-stream: user interaction with application User ID User since Logins/m onth Avg DL KB/day … 123456 1/3/2004 6 30 998323 5/3/2009 1 5 345375 8/2/2005 22 120 … … … … User data Clicks
  • © Hortonworks Inc. 2013 Page 23 Detecting anomalous behavior records • User access profile modeled as vector of features • Detect anomalies in application access patterns – Rules based – Machine learning based (determine “outlier factor”: 0…1)
  • © Hortonworks Inc. 2013 Page 24 Using Hadoop for anomaly detection • HDFS: central repository for all raw data – Raw user-access logs – User information (organization, demographics) • Pre-process – Build access-profile (behavioral) for each user • Detect anomalies – In Hadoop – Using existing tools: R, SAS, rules engine, etc
  • © Hortonworks Inc. 2013 Page 25 How do I get started?
  • © Hortonworks Inc. 2013 Page 26 1. Pick a good use-case that delivers immediate business value 2. Implement a proof-of-value (POV) 3. Build a team (hire/train) Getting started with Data science on Hadoop
  • © Hortonworks Inc. 2013 Page 27 • Put together a Hadoop cluster • Define the POV business use-case • Pull raw data you need into the cluster • Build it • Show the business value of your data assets Contact us. We can help! Implement a proof-of-value
  • © Hortonworks Inc. 2013 Page 28 Build a team: The data scientist skillset continuum Software engineer Research Scientist Data Engineer Data Scientist Applied Scientist Role Data Engineer Applied Scientist Function Builds production-grade data products Finds signal/meaning in the data Applies statistical/ML models and tunes the algorithm Good at…. Data and Systems architecture Hadoop, PIG/HIVE, MapReduce, mahout Java, Python, Perl, SQL, C++, etc NoSQL (Hbase, Cassandra, Mongo) Statistics, Machine learning Text processing, NLP R, Matlab, SAS, SQL Sciptring, prototyping Visualization / telling the story
  • © Hortonworks Inc. 2013 Page 29 Thank you! Any Questions? Ofer Mendelevitch Director, Data Sciences @ Hortonworks ofer@hortonworks.com @ofermend We’re hiring! Data Science training: www.hortonworks.com/training