Ted Willke, Intel Labs MLconf 2013

Graph Analytics Operation

Intel Labs

Machine Learning may
nourish the soul…
Source: Wikipedia (Banquet)

... but Data Preparation
will consume it.

Source: Wikipedia (Hell)

Machine Learning on Large Datasets
Data Quality and Feature Engineering
New
Data

Input
Data

Feature
Data

Extract
Transform
Load

Argghh!

•
•
•
•

Training Set

Build
Model

Validate

Validation
Set
Test
Set

Value

Figure out what’s there
Extract a bunch of features
Figure out what’s needed
Finalize and feed

Supervised Learning
Supervised and
Unsupervised Learning

3

Problems with Processing Large Datasets
Not turn-key
Are data scientists really expected to know…
how to set up Hadoop from scratch?
java, pig, Hadoop APIs?
how to extend with UDFs?
how to extract, analyze and visualize output beyond Hadoop?

“After hours of debugging our Hadoop setup, I was ecstatic to run a
Hadoop command without a java stack trace.”
- Zach

Problems with Processing Large Datasets
Not agile
Traditional Environment

Distributed Environment

Command response

< 1 sec

> 30 sec

Dependency inclusion

Simple

Several steps and changes

Validation

Established methods

Not clear

Development Cycle

Fast Iteration

Slow or linear

Apache Pig
• A dataflow processing system for MapReduce
• A high-level scripting language -- Pig Latin

Why Pig for ETL?
• Easy to get up & running
• Easy to program – simple declarative scripting
language , built-in dataflow primitives
• Nested data model support
• First class extensibility – custom filters,
transforms, input/output formats, etc.
• Automatic dataflow optimization – Pig/MR runtime:
~0.97x for 0.12
• As configurable as MR

The story gets even better
• Elephant Bird – good support for different formats,
codecs, etc.
• DataFu – Pig UDFs for data mining & statistics
• PiggyBank – collection of additional UDFs

So, we’re done, right?
No. Many open challenges,
including complex models.

Property Graph Data Models

Source: Tinkerpop (Property Graph)

Graph Applications
Machine
Learning
•
•
•
•
•
•
•

Neural Networks
Deep Learning (RBM)
Belief Propagation
Label Propagation/ARW
Collaborative Filtering
(ALS, SGD, SVD)
Topic Modeling (LDA)
K-Means

Mining
•
•
•
•
•
•
•
•

PageRank
Random Walk with
Restart
Connected Components
Triangle Counting
K-Truss
Centrality Measures
Network Diameter
Degree Distribution

Traversal
(Search)
•
•

Depth-First Search
Breadth-First Search

Graphical Machine Learning
• Need fully-integrated solutions that are easy to program
• Scale like Hadoop; speed and accuracy of in-memory graph analytics and mining
• Enables applications in broadband services, network security, retail, life sciences,
financial markets, etc.

HDFS

Intel
Graph Builder on

Graph
Query
Processing &
Storage

DB

Web
Docs

Input Data

Construct Graph

Build Model

Serve Model

Insight &
Prediction

Graph Processing: Technology Challenges
Performance – Has skyrocketed with in-memory and asynchronous
graph engines and scalable graph query architectures
Algorithms – A wide range of toolkits with graph mining and graphical
machine learning algorithms, with more sophistication and scaled
versions arriving “every day”

Traction

Data Models – Most large-scale work still on homogeneous graphs but
property graphs and meta-path concepts are more widely discussed
Programming – Challenging programming models in languages not
popular with data scientists, IT developers, and other end-users
Not so much

Progress!

Data Visualization – No great packages to visualize relationships du
jour and interactive big data sampling and projection too crude & slow
Data Preparation – Takes way too long, is way too manual, and is
fraught with error

Integration – Multiple frameworks are difficult to synchronize,
coordinate, and manage

Intel Labs continues to work on the gaps.

Pig ETL for Graphs?
Nothing specific for graph ETL. What’s needed:
• support for well-known input-output graph formats
• graph specific filters & transforms
• STORE functions for graph stores
Original
Vision

Graph Builder 2 Alpha
•
•
•
•

Construction of heterogeneous information networks with Pig
Better “progressive refinement” during acquisition, cleaning, and integration
Incremental graph construction
Interfacing for popular graph databases (Titan, RDF output, etc.)
Product Graph

Ratings Graph

likes

Bicycles

likes
likes

likes

Ted may like
bicycle-powered food cart

Food
Cart

uses

likes
likes

Frank

friends

friends

friends
Ted

Social Graph

friends

Mohit

Ivy

friends

Kushal

friends

friends
Nezih

brothers

Danny

* Inspired by, “Titan: Rise of Big Graph Data,” by M. Rodriguez and M. Broecheler

Raw Data

RDBMS

HDFS

Example Stack Architecture
Graph ETL
Pig

Graph
Builder
Real-time Graph Queries

NFS

Giraph

Blueprints

ML

Mahout

ZooKeeper

Hadoop

Gremlin

Titan
HBase

HDFS

Rexster

Graph
Analytics

Feature
Store
Model
Store

Graph ETL Example

Extract

Transform

Parse HTML, look for
links and words
{

Archive Records
http://www.1stvwparts.com/default.php?c
Path=159 74.86.123.84 20091120145711
text/html 28628HTTP/1.1 200 OK
<table border="0" width="100%"
cellspacing="0" cellpadding="0"> <tr>
<td width="100%"
class="infoBoxHeading_search">Quick
Find</td> </tr></table><table
border="0" width="100%" cellspacing="0"
cellpadding="0" class="infoBox_search">
<tr>
<td><table border="0"
width="100%" cellspacing="0"
cellpadding="3“
. . .

"url":
"http://www.1stvwparts.com/default.php?c
Path=159",
"timestamp": "20091120145711",
"links":
["http://www.1stvwparts.com/shopping_car
t.php", "http://www.partsfirm.com",
...],
"words": ["covermodel", "golf",
"rabbit", "just", "text", "hig",
"platestainless", "find", "copyright“,
“html”, “php”]
}

PageRank and Latent Dirichlet Allocation

Load

Graph Builder to
Titan
row

src

dst #links

67033:-20071306431384422339653 http://www.kog.com
http://www.dlstainedglass.com 2
91658:-20071306431384422339653 http://www.kog.com
http://www.haegerstainedglass.com
2
941:-19442631361384422339653 http://www.ks-p.jp http://www.drag-race.nuhuh.bee.pl
1
44116:-18273037921384422339653 http://www.kune.fr
http://www.chezfanny.fr 3
36891:-18273037921384422339653 http://www.kune.fr
http://www.wp-jobboard.kune.fr 3
79906:-17817899301384422339654 http://www.kwc.edu
http://www.umsl-sports.com
1
2238:-17817799001384422339654 http://www.kwc.org
http://www.onlamp.com 1
68133:-17817799001384422339654 http://www.kwc.org
http://www.tjhsst.edu 1
30677:-17817799001384422339654 http://www.kwc.org
http://www.floydlandis.com
1
81185:-17817799001384422339654 http://www.kwc.org
http://www.you-are-here.com 1
47527:-17817799001384422339654 http://www.kwc.org
http://www.phonak-cycling.ch 1
63112:-17817799001384422339654 http://www.kwc.org
http://www.link.brightcove.com 1
74837:-17817799001384422339654 http://www.kwc.org
http://www.trustbut.blogspot.com
6
53668:-17817799001384422339654 http://www.kwc.org
http://www.icanhascheezburger.com
4
97945:-17817799001384422339654 http://www.kwc.org
http://www.mythbustersfanclub.com
12
93849:-17709983361384422339654 http://www.kwmd.us
http://www.sierraclub.typepad.com
1
51421:-17700453681384422339654 http://www.kwne.jp
http://www.ppvj.co.jp 1
13022:-17651665521384422339654 http://www.kwu.edu
http://www.rollinghillszoo.com 2
16530:-17113867601384422339654 http://www.kyou.nu
http://www.fan.unfading-scar.net
2
14199:-16755866041384422339654 http://www.kzy.com
http://www.wbbm780.com 1
95253:-16755866041384422339654 http://www.kzy.com
http://www.brewview.com 1
25828:-14077538951384422339655 http://www.lee.org
http://www.kaiju.com 1
88133:-14077538951384422339655 http://www.lee.org
http://www.sfgov.org 2
94243:-14077538951384422339655 http://www.lee.org
http://www.liftport.com 1
56826:-14077538951384422339655 http://www.lee.org
http://www.nishioka.com 1
88574:-14077538951384422339655 http://www.lee.org
http://www.Smartflix.com
1
81966:-14077538951384422339655 http://www.lee.org
http://www.smartflix.com
145
83164:-14077538951384422339655 http://www.lee.org
http://www.torrentspy.com
1
99087:-14077538951384422339655 http://www.lee.org
http://www.SerpentMother.com 1
39124:-14077538951384422339655 http://www.lee.org
http://www.serpentmother.com 3
95995:-14077538951384422339655 http://www.lee.org
http://www.toolbar.google.com 2

Development Pains with Pig As-Is
Data Process Flow

Development Flow
(or, what actually happened)

Load with Pig
Turn into edge list (Pig, UDF)
Store to HDFS (Pig)
Load into Titan (GraphBuilder)
Run ML algorithms (Giraph)
Model queries (Gremlin)

Extract with python
Develop transforms
Test on a couple files
Fix bugs
Run python in Jython (fail miserably)
Spend too much time enabling
Write UDF in Java
Find limitations
Develop custom load UDF instead
…

All of this before any Machine Learning!

Out-of-the-Box Tools
Custom UDFs add a lot of complexity, time and effort.
If you don’t have this….
X = FOREACH A GENERATE
TOKENIZE(f1);

(More of these please)

You’re stuck with this…
package org.apache.pig.builtin;
import
import
import
import
import
import
import
import
import

java.io.IOException;
java.util.StringTokenizer;
org.apache.pig.EvalFunc;
org.apache.pig.data.BagFactory;
org.apache.pig.data.DataBag;
org.apache.pig.data.Tuple;
org.apache.pig.data.TupleFactory;
org.apache.pig.impl.logicalLayer.schema.Schema;
org.apache.pig.data.DataType;

public class TOKENIZE extends EvalFunc<DataBag> {
TupleFactory mTupleFactory = TupleFactory.g
BagFactory mBagFactory = BagFactory.getInst
public DataBag exec(Tuple input) throws IOE
try {
DataBag output = mB
Object o = input.ge
if (!(o instanceof
throw n
}
StringTokenizer tok
while (tok.hasMoreT
return output;
} catch (ExecException ee) {
// error handling g
}
}
public Schema outputSchema(Schema input) {

Breadth of Knowledge
Java

Pig

Load Raw Data
Extract Links
Filter Bad Data
Group Like
Links Together
Store - HBase

Store into Titan (Graph Builder)

MapReduce

Even if you have ninja skills, you’ll
still need to deal with weirdness.

Random Record
{
"url": "http://www.1stvwparts.com/default.php?cPath=159",
"timestamp": "20091120145711",
"words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti",
"cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was",
"hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials",
"quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for",
"sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod",
"beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care",
"search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector",
"com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post",
"cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store",
"more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance",
"account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact",
"control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check",
"registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online",
"interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog",
"date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the",
"selectcabriocceoseurovangligolfgtijettajetta", "left"]
}

Random Record
Uselessly common words
{
"timestamp": "20091120145711",
}

Common connector words can be trimmed
…with a bunch more ETL.

Random Record
Words mangled together?
{
"timestamp": "20091120145711",
}

Is there an edge case that’s causing this?

Random Record
Were these actually visible?
{
"timestamp": "20091120145711",
}

“html” was found in every record, something seems wrong.

raw_data = LOAD '/zach/common-crawl/1285409360731_9.arc.gz' USING ArcLoader()
AS (header:chararray, html:chararray);

edge_list = FOREACH raw_data GENERATE ExtractLinks(*);

Load raw data
Extract links

edge_list_filtered = FILTER edge_list BY FilterAny(*);
src_based = FOREACH edge_list_filtered GENERATE NormalizeURL(*, 0);
src_based_cleaned = FILTER src_based BY FilterMalformedURL(*, 1);
Filter & Normalize
dest_based = FOREACH src_based_cleaned GENERATE NormalizeURL(*, 1);
dest_based_self_loops_removed = FILTER dest_based BY FilterLoop(*);
final = FILTER dest_based_self_loops_removed
BY NOT (src_domain MATCHES '.*mailto.*' OR dest_domain MATCHES '.*mailto.*');
grouped = GROUP final BY (src_domain,dest_domain) PARALLEL 64;
with_link_count = FOREACH grouped GENERATE group.src_domain,
group.dest_domain,
COUNT(final) AS num_links:long;

Generate
Link
Counts

with_hbase_keys = FOREACH with_link_count GENERATE RowKeyAssignerUDF(*);
final_graph = FOREACH with_hbase_keys GENERATE FLATTEN($0)
Assign HBase Keys
AS (key:chararray, src_domain:chararray, dest_domain:chararray, num_links:long);

STORE_GRAPH(final_graph, 'hbase://pagerank_edge_list', 'Titan');

Store into Titan

Open Problems with Pig ETL
(for Data Science)

User
Interface

Interactive Mode

Embedded Mode
(Java, Python, etc.)

LOAD Functions

1

Pig Scripting Interface

Built in Functions
and Operators

Parser

Data Type Support
Planner

STORE Functions

Backend & Execution
Engines

Batch Mode

Open source
packages

UDFs

MR Jobs

Complex JSON/XML processing is painful
{ "Top-Level-Field": "top_level", "Inner-Json": [{ "Name": "inner-name", "Value": 10 }]}
json_data = LOAD 'test.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
unnested = FOREACH json_data GENERATE $0#'Top-Level-Field' AS (top_level_field_value: chararray),
FLATTEN($0#'Inner-Json') AS (inner_json: map[]);
unnested = FOREACH unnested GENERATE top_level_field_value,
FLATTEN(inner_json#'Name') AS (inner_name: chararray),
FLATTEN(inner_json#'Value') AS (inner_value:long);

User
Interface

Interactive Mode

2

Embedded Mode

LOAD Functions


Built in Functions
and Operators

Batch Mode

Parser

Data Type Support
Planner

STORE Functions

Backend & Execution
Engines

Better high-level language integration
Native-like experience with non-JVM languages (Python, R, etc.)
REST interface can be improved (HCATALOG-182)

Open source
packages

UDFs

MR Jobs

3

Interactive Mode

Embedded Mode

LOAD Functions


Built in Functions
and Operators

Parser

Data Type Support
Planner

STORE Functions

Backend & Execution
Engines

Better data exploration & error reporting
Faster iterative processing (Spark, YARN)
Better SAMPLE (WIP: PIG-1713)
SUMMARY for descriptive statistics
More descriptive error messages

Batch Mode

Open source
packages

UDFs

MR Jobs

Embedded Mode

Interactive Mode

4

LOAD Functions


Built in Functions
and Operators

Parser

Data Type Support
Planner

STORE Functions

Backend & Execution
Engines

Better control with HBaseStorage
Inefficient for bulk loading
Better HBase filter support
Batching support
Fetch multiple versions

Batch Mode

Open source
packages

UDFs

MR Jobs

Questions?
• Graph Builder 2 Alpha Dec’13
• Apache 2.0 OS code available at:
www.01.org/graphbuilder/

Legal Notices
•

•
•
•

•
•
•
•
•

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR
IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.
EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY
WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL®
PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL
PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.
Intel may make changes to specifications and product descriptions at any time, without notice.
All products, dates, and figures specified are preliminary based on current expectations, and are subject to change
without notice.
Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause
the product to deviate from published specifications. Current characterized errata are available on request.
Code names featured are used internally within Intel to identify products that are in development and not yet publicly
announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in
advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the
sole risk of the user
Performance tests and ratings are measured using specific computer systems and/or components and reflect the
approximate performance of Intel products as measured by those tests. Any difference in system hardware or
software design or configuration may affect actual performance.
Intel, Intel Inside, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2013 Intel Corporation.

Abstract
Intel is working hard to build datacenter software from the silicon up that
provides for a wide range of advanced analytics on Apache Hadoop. The
Graph Analytics Operation within Intel Labs is helping to transform Hadoop
into a full-blown “knowledge discovery platform” that can deftly process a
wide range of data models, from simple tables to multi-property graphs,
using sophisticated machine learning algorithms and data mining
techniques. But, the analysis cannot start until features are engineered, a
task that takes a lot of time and effort today. In this talk, I will describe
some of the Hadoop-based tools we are developing to make it easier for
data scientists to deal with data quality issues and construct features for
scalable machine learning, including graph-based approaches

Ted Willke, Intel Labs MLconf 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Ted Willke, Intel Labs MLconf 2013

Similar to Ted Willke, Intel Labs MLconf 2013 (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Ted Willke, Intel Labs MLconf 2013