Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ted Willke, Intel Labs MLconf 2013


Published on

Ted Willke, Principal Engineer/GM, Intel Labs: "Avoiding Cluster-Scale Headaches with Better Tools for Data Quality and Feature Engineering"

Published in: Technology, Education
  • Be the first to comment

Ted Willke, Intel Labs MLconf 2013

  1. 1. Graph Analytics Operation Intel Labs
  2. 2. Machine Learning may nourish the soul… Source: Wikipedia (Banquet) ... but Data Preparation will consume it. Source: Wikipedia (Hell)
  3. 3. Machine Learning on Large Datasets Data Quality and Feature Engineering New Data Input Data Feature Data Extract Transform Load Argghh! • • • • Training Set Build Model Validate Validation Set Test Set Value Figure out what’s there Extract a bunch of features Figure out what’s needed Finalize and feed Supervised Learning Supervised and Unsupervised Learning 3
  4. 4. Problems with Processing Large Datasets Not turn-key Are data scientists really expected to know… how to set up Hadoop from scratch? java, pig, Hadoop APIs? how to extend with UDFs? how to extract, analyze and visualize output beyond Hadoop? “After hours of debugging our Hadoop setup, I was ecstatic to run a Hadoop command without a java stack trace.” - Zach
  5. 5. Problems with Processing Large Datasets Not agile Traditional Environment Distributed Environment Command response < 1 sec > 30 sec Dependency inclusion Simple Several steps and changes Validation Established methods Not clear Development Cycle Fast Iteration Slow or linear
  6. 6. Apache Pig • A dataflow processing system for MapReduce • A high-level scripting language -- Pig Latin
  7. 7. Why Pig for ETL? • Easy to get up & running • Easy to program – simple declarative scripting language , built-in dataflow primitives • Nested data model support • First class extensibility – custom filters, transforms, input/output formats, etc. • Automatic dataflow optimization – Pig/MR runtime: ~0.97x for 0.12 • As configurable as MR
  8. 8. The story gets even better • Elephant Bird – good support for different formats, codecs, etc. • DataFu – Pig UDFs for data mining & statistics • PiggyBank – collection of additional UDFs
  9. 9. So, we’re done, right? No. Many open challenges, including complex models.
  10. 10. Property Graph Data Models Source: Tinkerpop (Property Graph)
  11. 11. Graph Applications Machine Learning • • • • • • • Neural Networks Deep Learning (RBM) Belief Propagation Label Propagation/ARW Collaborative Filtering (ALS, SGD, SVD) Topic Modeling (LDA) K-Means Mining • • • • • • • • PageRank Random Walk with Restart Connected Components Triangle Counting K-Truss Centrality Measures Network Diameter Degree Distribution Traversal (Search) • • Depth-First Search Breadth-First Search
  12. 12. Graphical Machine Learning • Need fully-integrated solutions that are easy to program • Scale like Hadoop; speed and accuracy of in-memory graph analytics and mining • Enables applications in broadband services, network security, retail, life sciences, financial markets, etc. HDFS Intel Graph Builder on Graph Query Processing & Storage DB Web Docs Input Data Construct Graph Build Model Serve Model Insight & Prediction
  13. 13. Graph Processing: Technology Challenges Performance – Has skyrocketed with in-memory and asynchronous graph engines and scalable graph query architectures Algorithms – A wide range of toolkits with graph mining and graphical machine learning algorithms, with more sophistication and scaled versions arriving “every day” Traction Data Models – Most large-scale work still on homogeneous graphs but property graphs and meta-path concepts are more widely discussed Programming – Challenging programming models in languages not popular with data scientists, IT developers, and other end-users Not so much Progress! Data Visualization – No great packages to visualize relationships du jour and interactive big data sampling and projection too crude & slow Data Preparation – Takes way too long, is way too manual, and is fraught with error Integration – Multiple frameworks are difficult to synchronize, coordinate, and manage Intel Labs continues to work on the gaps.
  14. 14. Pig ETL for Graphs? Nothing specific for graph ETL. What’s needed: • support for well-known input-output graph formats • graph specific filters & transforms • STORE functions for graph stores Original Vision
  15. 15. Graph Builder 2 Alpha • • • • Construction of heterogeneous information networks with Pig Better “progressive refinement” during acquisition, cleaning, and integration Incremental graph construction Interfacing for popular graph databases (Titan, RDF output, etc.) Product Graph Ratings Graph likes Bicycles likes likes likes Ted may like bicycle-powered food cart Food Cart uses likes likes Frank friends friends friends Ted Social Graph friends Mohit Ivy friends Kushal friends friends Nezih brothers Danny * Inspired by, “Titan: Rise of Big Graph Data,” by M. Rodriguez and M. Broecheler
  16. 16. Raw Data RDBMS HDFS Example Stack Architecture Graph ETL Pig Graph Builder Real-time Graph Queries NFS Giraph Blueprints ML Mahout ZooKeeper Hadoop Gremlin Titan HBase HDFS Rexster Graph Analytics Feature Store Model Store
  17. 17. Graph ETL Example Extract Transform Parse HTML, look for links and words { Archive Records Path=159 20091120145711 text/html 28628HTTP/1.1 200 OK <table border="0" width="100%" cellspacing="0" cellpadding="0"> <tr> <td width="100%" class="infoBoxHeading_search">Quick Find</td> </tr></table><table border="0" width="100%" cellspacing="0" cellpadding="0" class="infoBox_search"> <tr> <td><table border="0" width="100%" cellspacing="0" cellpadding="3“ . . . "url": " Path=159", "timestamp": "20091120145711", "links": [" t.php", "", ...], "words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright“, “html”, “php”] } PageRank and Latent Dirichlet Allocation Load Graph Builder to Titan row src dst #links 67033:-20071306431384422339653 2 91658:-20071306431384422339653 2 941:-19442631361384422339653 1 44116:-18273037921384422339653 3 36891:-18273037921384422339653 3 79906:-17817899301384422339654 1 2238:-17817799001384422339654 1 68133:-17817799001384422339654 1 30677:-17817799001384422339654 1 81185:-17817799001384422339654 1 47527:-17817799001384422339654 1 63112:-17817799001384422339654 1 74837:-17817799001384422339654 6 53668:-17817799001384422339654 4 97945:-17817799001384422339654 12 93849:-17709983361384422339654 1 51421:-17700453681384422339654 1 13022:-17651665521384422339654 2 16530:-17113867601384422339654 2 14199:-16755866041384422339654 1 95253:-16755866041384422339654 1 25828:-14077538951384422339655 1 88133:-14077538951384422339655 2 94243:-14077538951384422339655 1 56826:-14077538951384422339655 1 88574:-14077538951384422339655 1 81966:-14077538951384422339655 145 83164:-14077538951384422339655 1 99087:-14077538951384422339655 1 39124:-14077538951384422339655 3 95995:-14077538951384422339655 2
  18. 18. Development Pains with Pig As-Is Data Process Flow Development Flow (or, what actually happened) Load with Pig Turn into edge list (Pig, UDF) Store to HDFS (Pig) Load into Titan (GraphBuilder) Run ML algorithms (Giraph) Model queries (Gremlin) Extract with python Develop transforms Test on a couple files Fix bugs Run python in Jython (fail miserably) Spend too much time enabling Write UDF in Java Find limitations Develop custom load UDF instead … All of this before any Machine Learning!
  19. 19. Out-of-the-Box Tools Custom UDFs add a lot of complexity, time and effort. If you don’t have this…. X = FOREACH A GENERATE TOKENIZE(f1); (More of these please) You’re stuck with this… package org.apache.pig.builtin; import import import import import import import import import; java.util.StringTokenizer; org.apache.pig.EvalFunc;;;;; org.apache.pig.impl.logicalLayer.schema.Schema;; public class TOKENIZE extends EvalFunc<DataBag> { TupleFactory mTupleFactory = TupleFactory.g BagFactory mBagFactory = BagFactory.getInst public DataBag exec(Tuple input) throws IOE try { DataBag output = mB Object o = if (!(o instanceof throw n } StringTokenizer tok while (tok.hasMoreT return output; } catch (ExecException ee) { // error handling g } } public Schema outputSchema(Schema input) {
  20. 20. Breadth of Knowledge Java Pig Load Raw Data Extract Links Filter Bad Data Group Like Links Together Store - HBase Store into Titan (Graph Builder) MapReduce
  21. 21. Even if you have ninja skills, you’ll still need to deal with weirdness.
  22. 22. Random Record { "url": "", "timestamp": "20091120145711", "words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti", "cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was", "hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials", "quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for", "sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod", "beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care", "search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector", "com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post", "cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store", "more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance", "account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact", "control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check", "registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online", "interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog", "date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the", "selectcabriocceoseurovangligolfgtijettajetta", "left"] }
  23. 23. Random Record Uselessly common words { "url": "", "timestamp": "20091120145711", "words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti", "cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was", "hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials", "quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for", "sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod", "beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care", "search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector", "com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post", "cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store", "more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance", "account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact", "control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check", "registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online", "interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog", "date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the", "selectcabriocceoseurovangligolfgtijettajetta", "left"] } Common connector words can be trimmed …with a bunch more ETL.
  24. 24. Random Record Words mangled together? { "url": "", "timestamp": "20091120145711", "words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti", "cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was", "hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials", "quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for", "sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod", "beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care", "search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector", "com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post", "cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store", "more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance", "account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact", "control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check", "registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online", "interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog", "date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the", "selectcabriocceoseurovangligolfgtijettajetta", "left"] } Is there an edge case that’s causing this?
  25. 25. Random Record Were these actually visible? { "url": "", "timestamp": "20091120145711", "words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti", "cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was", "hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials", "quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for", "sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod", "beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care", "search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector", "com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post", "cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store", "more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance", "account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact", "control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check", "registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online", "interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog", "date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the", "selectcabriocceoseurovangligolfgtijettajetta", "left"] } “html” was found in every record, something seems wrong.
  26. 26. raw_data = LOAD '/zach/common-crawl/1285409360731_9.arc.gz' USING ArcLoader() AS (header:chararray, html:chararray); edge_list = FOREACH raw_data GENERATE ExtractLinks(*); Load raw data Extract links edge_list_filtered = FILTER edge_list BY FilterAny(*); src_based = FOREACH edge_list_filtered GENERATE NormalizeURL(*, 0); src_based_cleaned = FILTER src_based BY FilterMalformedURL(*, 1); Filter & Normalize dest_based = FOREACH src_based_cleaned GENERATE NormalizeURL(*, 1); dest_based_self_loops_removed = FILTER dest_based BY FilterLoop(*); final = FILTER dest_based_self_loops_removed BY NOT (src_domain MATCHES '.*mailto.*' OR dest_domain MATCHES '.*mailto.*'); grouped = GROUP final BY (src_domain,dest_domain) PARALLEL 64; with_link_count = FOREACH grouped GENERATE group.src_domain, group.dest_domain, COUNT(final) AS num_links:long; Generate Link Counts with_hbase_keys = FOREACH with_link_count GENERATE RowKeyAssignerUDF(*); final_graph = FOREACH with_hbase_keys GENERATE FLATTEN($0) Assign HBase Keys AS (key:chararray, src_domain:chararray, dest_domain:chararray, num_links:long); STORE_GRAPH(final_graph, 'hbase://pagerank_edge_list', 'Titan'); Store into Titan
  27. 27. Demo.
  28. 28. Open Problems with Pig ETL (for Data Science)
  29. 29. User Interface Interactive Mode Embedded Mode (Java, Python, etc.) LOAD Functions 1 Pig Scripting Interface Built in Functions and Operators Parser Data Type Support Planner STORE Functions Backend & Execution Engines Batch Mode Open source packages UDFs MR Jobs Complex JSON/XML processing is painful { "Top-Level-Field": "top_level", "Inner-Json": [{ "Name": "inner-name", "Value": 10 }]} json_data = LOAD 'test.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad'); unnested = FOREACH json_data GENERATE $0#'Top-Level-Field' AS (top_level_field_value: chararray), FLATTEN($0#'Inner-Json') AS (inner_json: map[]); unnested = FOREACH unnested GENERATE top_level_field_value, FLATTEN(inner_json#'Name') AS (inner_name: chararray), FLATTEN(inner_json#'Value') AS (inner_value:long);
  30. 30. User Interface Interactive Mode 2 Embedded Mode (Java, Python, etc.) LOAD Functions Pig Scripting Interface Built in Functions and Operators Batch Mode Parser Data Type Support Planner STORE Functions Backend & Execution Engines Better high-level language integration Native-like experience with non-JVM languages (Python, R, etc.) REST interface can be improved (HCATALOG-182) Open source packages UDFs MR Jobs
  31. 31. 3 Interactive Mode Embedded Mode (Java, Python, etc.) LOAD Functions Pig Scripting Interface Built in Functions and Operators Parser Data Type Support Planner STORE Functions Backend & Execution Engines Better data exploration & error reporting Faster iterative processing (Spark, YARN) Better SAMPLE (WIP: PIG-1713) SUMMARY for descriptive statistics More descriptive error messages Batch Mode Open source packages UDFs MR Jobs
  32. 32. Embedded Mode (Java, Python, etc.) Interactive Mode 4 LOAD Functions Pig Scripting Interface Built in Functions and Operators Parser Data Type Support Planner STORE Functions Backend & Execution Engines Better control with HBaseStorage Inefficient for bulk loading Better HBase filter support Batching support Fetch multiple versions Batch Mode Open source packages UDFs MR Jobs
  33. 33. Questions? • Graph Builder 2 Alpha Dec’13 • Apache 2.0 OS code available at:
  34. 34. Legal Notices • • • • • • • • • INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS. Intel may make changes to specifications and product descriptions at any time, without notice. All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Intel, Intel Inside, and the Intel logo are trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2013 Intel Corporation.
  35. 35. Abstract Intel is working hard to build datacenter software from the silicon up that provides for a wide range of advanced analytics on Apache Hadoop. The Graph Analytics Operation within Intel Labs is helping to transform Hadoop into a full-blown “knowledge discovery platform” that can deftly process a wide range of data models, from simple tables to multi-property graphs, using sophisticated machine learning algorithms and data mining techniques. But, the analysis cannot start until features are engineered, a task that takes a lot of time and effort today. In this talk, I will describe some of the Hadoop-based tools we are developing to make it easier for data scientists to deal with data quality issues and construct features for scalable machine learning, including graph-based approaches