SlideShare a Scribd company logo
1 of 35
Download to read offline
Graph Analytics Operation

Intel Labs
Machine Learning may
nourish the soul…
Source: Wikipedia (Banquet)

... but Data Preparation
will consume it.

Source: Wikipedia (Hell)
Machine Learning on Large Datasets
Data Quality and Feature Engineering
New
Data

Input
Data

Feature
Data

Extract
Transform
Load

Argghh!

•
•
•
•

Training Set

Build
Model

Validate

Validation
Set
Test
Set

Value

Figure out what’s there
Extract a bunch of features
Figure out what’s needed
Finalize and feed

Supervised Learning
Supervised and
Unsupervised Learning

3
Problems with Processing Large Datasets
Not turn-key
Are data scientists really expected to know…
how to set up Hadoop from scratch?
java, pig, Hadoop APIs?
how to extend with UDFs?
how to extract, analyze and visualize output beyond Hadoop?

“After hours of debugging our Hadoop setup, I was ecstatic to run a
Hadoop command without a java stack trace.”
- Zach
Problems with Processing Large Datasets
Not agile
Traditional Environment

Distributed Environment

Command response

< 1 sec

> 30 sec

Dependency inclusion

Simple

Several steps and changes

Validation

Established methods

Not clear

Development Cycle

Fast Iteration

Slow or linear
Apache Pig
• A dataflow processing system for MapReduce
• A high-level scripting language -- Pig Latin
Why Pig for ETL?
• Easy to get up & running
• Easy to program – simple declarative scripting
language , built-in dataflow primitives
• Nested data model support
• First class extensibility – custom filters,
transforms, input/output formats, etc.
• Automatic dataflow optimization – Pig/MR runtime:
~0.97x for 0.12
• As configurable as MR
The story gets even better
• Elephant Bird – good support for different formats,
codecs, etc.
• DataFu – Pig UDFs for data mining & statistics
• PiggyBank – collection of additional UDFs
So, we’re done, right?
No. Many open challenges,
including complex models.
Property Graph Data Models

Source: Tinkerpop (Property Graph)
Graph Applications
Machine
Learning
•
•
•
•
•
•
•

Neural Networks
Deep Learning (RBM)
Belief Propagation
Label Propagation/ARW
Collaborative Filtering
(ALS, SGD, SVD)
Topic Modeling (LDA)
K-Means

Mining
•
•
•
•
•
•
•
•

PageRank
Random Walk with
Restart
Connected Components
Triangle Counting
K-Truss
Centrality Measures
Network Diameter
Degree Distribution

Traversal
(Search)
•
•

Depth-First Search
Breadth-First Search
Graphical Machine Learning
• Need fully-integrated solutions that are easy to program
• Scale like Hadoop; speed and accuracy of in-memory graph analytics and mining
• Enables applications in broadband services, network security, retail, life sciences,
financial markets, etc.

HDFS

Intel
Graph Builder on

Graph
Query
Processing &
Storage

DB

Web
Docs

Input Data

Construct Graph

Build Model

Serve Model

Insight &
Prediction
Graph Processing: Technology Challenges
Performance – Has skyrocketed with in-memory and asynchronous
graph engines and scalable graph query architectures
Algorithms – A wide range of toolkits with graph mining and graphical
machine learning algorithms, with more sophistication and scaled
versions arriving “every day”

Traction

Data Models – Most large-scale work still on homogeneous graphs but
property graphs and meta-path concepts are more widely discussed
Programming – Challenging programming models in languages not
popular with data scientists, IT developers, and other end-users
Not so much

Progress!

Data Visualization – No great packages to visualize relationships du
jour and interactive big data sampling and projection too crude & slow
Data Preparation – Takes way too long, is way too manual, and is
fraught with error

Integration – Multiple frameworks are difficult to synchronize,
coordinate, and manage

Intel Labs continues to work on the gaps.
Pig ETL for Graphs?
Nothing specific for graph ETL. What’s needed:
• support for well-known input-output graph formats
• graph specific filters & transforms
• STORE functions for graph stores
Original
Vision
Graph Builder 2 Alpha
•
•
•
•

Construction of heterogeneous information networks with Pig
Better “progressive refinement” during acquisition, cleaning, and integration
Incremental graph construction
Interfacing for popular graph databases (Titan, RDF output, etc.)
Product Graph

Ratings Graph

likes

Bicycles

likes
likes

likes

Ted may like
bicycle-powered food cart

Food
Cart

uses

likes
likes

Frank

friends

friends

friends
Ted

Social Graph

friends

Mohit

Ivy

friends

Kushal

friends

friends
Nezih

brothers

Danny

* Inspired by, “Titan: Rise of Big Graph Data,” by M. Rodriguez and M. Broecheler
Raw Data

RDBMS

HDFS

Example Stack Architecture
Graph ETL
Pig

Graph
Builder
Real-time Graph Queries

NFS

Giraph

Blueprints

ML

Mahout

ZooKeeper

Hadoop

Gremlin

Titan
HBase

HDFS

Rexster

Graph
Analytics

Feature
Store
Model
Store
Graph ETL Example

Extract

Transform

Parse HTML, look for
links and words
{

Archive Records
http://www.1stvwparts.com/default.php?c
Path=159 74.86.123.84 20091120145711
text/html 28628HTTP/1.1 200 OK
<table border="0" width="100%"
cellspacing="0" cellpadding="0"> <tr>
<td width="100%"
class="infoBoxHeading_search">Quick
Find</td> </tr></table><table
border="0" width="100%" cellspacing="0"
cellpadding="0" class="infoBox_search">
<tr>
<td><table border="0"
width="100%" cellspacing="0"
cellpadding="3“
. . .

"url":
"http://www.1stvwparts.com/default.php?c
Path=159",
"timestamp": "20091120145711",
"links":
["http://www.1stvwparts.com/shopping_car
t.php", "http://www.partsfirm.com",
...],
"words": ["covermodel", "golf",
"rabbit", "just", "text", "hig",
"platestainless", "find", "copyright“,
“html”, “php”]
}

PageRank and Latent Dirichlet Allocation

Load

Graph Builder to
Titan
row

src

dst #links

67033:-20071306431384422339653 http://www.kog.com
http://www.dlstainedglass.com 2
91658:-20071306431384422339653 http://www.kog.com
http://www.haegerstainedglass.com
2
941:-19442631361384422339653 http://www.ks-p.jp http://www.drag-race.nuhuh.bee.pl
1
44116:-18273037921384422339653 http://www.kune.fr
http://www.chezfanny.fr 3
36891:-18273037921384422339653 http://www.kune.fr
http://www.wp-jobboard.kune.fr 3
79906:-17817899301384422339654 http://www.kwc.edu
http://www.umsl-sports.com
1
2238:-17817799001384422339654 http://www.kwc.org
http://www.onlamp.com 1
68133:-17817799001384422339654 http://www.kwc.org
http://www.tjhsst.edu 1
30677:-17817799001384422339654 http://www.kwc.org
http://www.floydlandis.com
1
81185:-17817799001384422339654 http://www.kwc.org
http://www.you-are-here.com 1
47527:-17817799001384422339654 http://www.kwc.org
http://www.phonak-cycling.ch 1
63112:-17817799001384422339654 http://www.kwc.org
http://www.link.brightcove.com 1
74837:-17817799001384422339654 http://www.kwc.org
http://www.trustbut.blogspot.com
6
53668:-17817799001384422339654 http://www.kwc.org
http://www.icanhascheezburger.com
4
97945:-17817799001384422339654 http://www.kwc.org
http://www.mythbustersfanclub.com
12
93849:-17709983361384422339654 http://www.kwmd.us
http://www.sierraclub.typepad.com
1
51421:-17700453681384422339654 http://www.kwne.jp
http://www.ppvj.co.jp 1
13022:-17651665521384422339654 http://www.kwu.edu
http://www.rollinghillszoo.com 2
16530:-17113867601384422339654 http://www.kyou.nu
http://www.fan.unfading-scar.net
2
14199:-16755866041384422339654 http://www.kzy.com
http://www.wbbm780.com 1
95253:-16755866041384422339654 http://www.kzy.com
http://www.brewview.com 1
25828:-14077538951384422339655 http://www.lee.org
http://www.kaiju.com 1
88133:-14077538951384422339655 http://www.lee.org
http://www.sfgov.org 2
94243:-14077538951384422339655 http://www.lee.org
http://www.liftport.com 1
56826:-14077538951384422339655 http://www.lee.org
http://www.nishioka.com 1
88574:-14077538951384422339655 http://www.lee.org
http://www.Smartflix.com
1
81966:-14077538951384422339655 http://www.lee.org
http://www.smartflix.com
145
83164:-14077538951384422339655 http://www.lee.org
http://www.torrentspy.com
1
99087:-14077538951384422339655 http://www.lee.org
http://www.SerpentMother.com 1
39124:-14077538951384422339655 http://www.lee.org
http://www.serpentmother.com 3
95995:-14077538951384422339655 http://www.lee.org
http://www.toolbar.google.com 2
Development Pains with Pig As-Is
Data Process Flow

Development Flow
(or, what actually happened)

Load with Pig
Turn into edge list (Pig, UDF)
Store to HDFS (Pig)
Load into Titan (GraphBuilder)
Run ML algorithms (Giraph)
Model queries (Gremlin)

Extract with python
Develop transforms
Test on a couple files
Fix bugs
Run python in Jython (fail miserably)
Spend too much time enabling
Write UDF in Java
Find limitations
Develop custom load UDF instead
…

All of this before any Machine Learning!
Out-of-the-Box Tools
Custom UDFs add a lot of complexity, time and effort.
If you don’t have this….
X = FOREACH A GENERATE
TOKENIZE(f1);

(More of these please)

You’re stuck with this…
package org.apache.pig.builtin;
import
import
import
import
import
import
import
import
import

java.io.IOException;
java.util.StringTokenizer;
org.apache.pig.EvalFunc;
org.apache.pig.data.BagFactory;
org.apache.pig.data.DataBag;
org.apache.pig.data.Tuple;
org.apache.pig.data.TupleFactory;
org.apache.pig.impl.logicalLayer.schema.Schema;
org.apache.pig.data.DataType;

public class TOKENIZE extends EvalFunc<DataBag> {
TupleFactory mTupleFactory = TupleFactory.g
BagFactory mBagFactory = BagFactory.getInst
public DataBag exec(Tuple input) throws IOE
try {
DataBag output = mB
Object o = input.ge
if (!(o instanceof
throw n
}
StringTokenizer tok
while (tok.hasMoreT
return output;
} catch (ExecException ee) {
// error handling g
}
}
public Schema outputSchema(Schema input) {
Breadth of Knowledge
Java

Pig

Load Raw Data
Extract Links
Filter Bad Data
Group Like
Links Together
Store - HBase

Store into Titan (Graph Builder)

MapReduce
Even if you have ninja skills, you’ll
still need to deal with weirdness.
Random Record
{
"url": "http://www.1stvwparts.com/default.php?cPath=159",
"timestamp": "20091120145711",
"words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti",
"cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was",
"hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials",
"quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for",
"sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod",
"beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care",
"search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector",
"com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post",
"cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store",
"more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance",
"account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact",
"control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check",
"registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online",
"interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog",
"date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the",
"selectcabriocceoseurovangligolfgtijettajetta", "left"]
}
Random Record
Uselessly common words
{
"url": "http://www.1stvwparts.com/default.php?cPath=159",
"timestamp": "20091120145711",
"words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti",
"cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was",
"hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials",
"quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for",
"sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod",
"beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care",
"search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector",
"com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post",
"cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store",
"more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance",
"account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact",
"control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check",
"registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online",
"interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog",
"date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the",
"selectcabriocceoseurovangligolfgtijettajetta", "left"]
}

Common connector words can be trimmed
…with a bunch more ETL.
Random Record
Words mangled together?
{
"url": "http://www.1stvwparts.com/default.php?cPath=159",
"timestamp": "20091120145711",
"words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti",
"cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was",
"hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials",
"quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for",
"sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod",
"beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care",
"search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector",
"com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post",
"cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store",
"more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance",
"account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact",
"control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check",
"registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online",
"interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog",
"date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the",
"selectcabriocceoseurovangligolfgtijettajetta", "left"]
}

Is there an edge case that’s causing this?
Random Record
Were these actually visible?
{
"url": "http://www.1stvwparts.com/default.php?cPath=159",
"timestamp": "20091120145711",
"words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti",
"cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was",
"hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials",
"quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for",
"sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod",
"beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care",
"search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector",
"com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post",
"cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store",
"more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance",
"account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact",
"control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check",
"registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online",
"interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog",
"date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the",
"selectcabriocceoseurovangligolfgtijettajetta", "left"]
}

“html” was found in every record, something seems wrong.
raw_data = LOAD '/zach/common-crawl/1285409360731_9.arc.gz' USING ArcLoader()
AS (header:chararray, html:chararray);

edge_list = FOREACH raw_data GENERATE ExtractLinks(*);

Load raw data
Extract links

edge_list_filtered = FILTER edge_list BY FilterAny(*);
src_based = FOREACH edge_list_filtered GENERATE NormalizeURL(*, 0);
src_based_cleaned = FILTER src_based BY FilterMalformedURL(*, 1);
Filter & Normalize
dest_based = FOREACH src_based_cleaned GENERATE NormalizeURL(*, 1);
dest_based_self_loops_removed = FILTER dest_based BY FilterLoop(*);
final = FILTER dest_based_self_loops_removed
BY NOT (src_domain MATCHES '.*mailto.*' OR dest_domain MATCHES '.*mailto.*');
grouped = GROUP final BY (src_domain,dest_domain) PARALLEL 64;
with_link_count = FOREACH grouped GENERATE group.src_domain,
group.dest_domain,
COUNT(final) AS num_links:long;

Generate
Link
Counts

with_hbase_keys = FOREACH with_link_count GENERATE RowKeyAssignerUDF(*);
final_graph = FOREACH with_hbase_keys GENERATE FLATTEN($0)
Assign HBase Keys
AS (key:chararray, src_domain:chararray, dest_domain:chararray, num_links:long);

STORE_GRAPH(final_graph, 'hbase://pagerank_edge_list', 'Titan');

Store into Titan
Demo.
Open Problems with Pig ETL
(for Data Science)
User
Interface

Interactive Mode

Embedded Mode
(Java, Python, etc.)

LOAD Functions

1

Pig Scripting Interface

Built in Functions
and Operators

Parser

Data Type Support
Planner

STORE Functions

Backend & Execution
Engines

Batch Mode

Open source
packages

UDFs

MR Jobs

Complex JSON/XML processing is painful
{ "Top-Level-Field": "top_level", "Inner-Json": [{ "Name": "inner-name", "Value": 10 }]}
json_data = LOAD 'test.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
unnested = FOREACH json_data GENERATE $0#'Top-Level-Field' AS (top_level_field_value: chararray),
FLATTEN($0#'Inner-Json') AS (inner_json: map[]);
unnested = FOREACH unnested GENERATE top_level_field_value,
FLATTEN(inner_json#'Name') AS (inner_name: chararray),
FLATTEN(inner_json#'Value') AS (inner_value:long);
User
Interface

Interactive Mode

2

Embedded Mode
(Java, Python, etc.)

LOAD Functions

Pig Scripting Interface

Built in Functions
and Operators

Batch Mode

Parser

Data Type Support
Planner

STORE Functions

Backend & Execution
Engines

Better high-level language integration
Native-like experience with non-JVM languages (Python, R, etc.)
REST interface can be improved (HCATALOG-182)

Open source
packages

UDFs

MR Jobs
3

Interactive Mode

Embedded Mode
(Java, Python, etc.)

LOAD Functions

Pig Scripting Interface

Built in Functions
and Operators

Parser

Data Type Support
Planner

STORE Functions

Backend & Execution
Engines

Better data exploration & error reporting
Faster iterative processing (Spark, YARN)
Better SAMPLE (WIP: PIG-1713)
SUMMARY for descriptive statistics
More descriptive error messages

Batch Mode

Open source
packages

UDFs

MR Jobs
Embedded Mode
(Java, Python, etc.)

Interactive Mode

4

LOAD Functions

Pig Scripting Interface

Built in Functions
and Operators

Parser

Data Type Support
Planner

STORE Functions

Backend & Execution
Engines

Better control with HBaseStorage
Inefficient for bulk loading
Better HBase filter support
Batching support
Fetch multiple versions

Batch Mode

Open source
packages

UDFs

MR Jobs
Questions?
• Graph Builder 2 Alpha Dec’13
• Apache 2.0 OS code available at:
www.01.org/graphbuilder/
Legal Notices
•

•
•
•

•
•
•
•
•

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR
IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.
EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY
WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL®
PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL
PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.
Intel may make changes to specifications and product descriptions at any time, without notice.
All products, dates, and figures specified are preliminary based on current expectations, and are subject to change
without notice.
Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause
the product to deviate from published specifications. Current characterized errata are available on request.
Code names featured are used internally within Intel to identify products that are in development and not yet publicly
announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in
advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the
sole risk of the user
Performance tests and ratings are measured using specific computer systems and/or components and reflect the
approximate performance of Intel products as measured by those tests. Any difference in system hardware or
software design or configuration may affect actual performance.
Intel, Intel Inside, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2013 Intel Corporation.
Abstract
Intel is working hard to build datacenter software from the silicon up that
provides for a wide range of advanced analytics on Apache Hadoop. The
Graph Analytics Operation within Intel Labs is helping to transform Hadoop
into a full-blown “knowledge discovery platform” that can deftly process a
wide range of data models, from simple tables to multi-property graphs,
using sophisticated machine learning algorithms and data mining
techniques. But, the analysis cannot start until features are engineered, a
task that takes a lot of time and effort today. In this talk, I will describe
some of the Hadoop-based tools we are developing to make it easier for
data scientists to deal with data quality issues and construct features for
scalable machine learning, including graph-based approaches

More Related Content

What's hot

High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OSri Ambati
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Turi, Inc.
 
H2O & Tensorflow - Fabrizio
H2O & Tensorflow - Fabrizio H2O & Tensorflow - Fabrizio
H2O & Tensorflow - Fabrizio Sri Ambati
 
Beyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at ScaleBeyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at ScaleTuri, Inc.
 
H2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneSri Ambati
 
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2ODeep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2OSri Ambati
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engineLars Marius Garshol
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with AzureBarbara Fusinska
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14Sri Ambati
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learningjoshwills
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Scalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2OScalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2Oodsc
 
Distributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlowDistributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlowJan Wiegelmann
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
 

What's hot (20)

High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
H2O & Tensorflow - Fabrizio
H2O & Tensorflow - Fabrizio H2O & Tensorflow - Fabrizio
H2O & Tensorflow - Fabrizio
 
Beyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at ScaleBeyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at Scale
 
H2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to Everyone
 
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2ODeep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with Azure
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Scalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2OScalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2O
 
Distributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlowDistributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlow
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
H20: A platform for big math
H20: A platform for big math H20: A platform for big math
H20: A platform for big math
 

Viewers also liked

Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYCTed Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYCMLconf
 
Tanvi Motwani, Lead Data Scientist, Guided Search at A9.com at MLconf ATL 2016
Tanvi Motwani, Lead Data Scientist, Guided Search at A9.com at MLconf ATL 2016Tanvi Motwani, Lead Data Scientist, Guided Search at A9.com at MLconf ATL 2016
Tanvi Motwani, Lead Data Scientist, Guided Search at A9.com at MLconf ATL 2016MLconf
 
Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016
Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016
Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016MLconf
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...MLconf
 
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016MLconf
 
Practical pig
Practical pigPractical pig
Practical pigtrihug
 

Viewers also liked (8)

Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYCTed Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
 
NPS_TDA_forPDF_JPrendki
NPS_TDA_forPDF_JPrendkiNPS_TDA_forPDF_JPrendki
NPS_TDA_forPDF_JPrendki
 
Cp04invitedslide
Cp04invitedslideCp04invitedslide
Cp04invitedslide
 
Tanvi Motwani, Lead Data Scientist, Guided Search at A9.com at MLconf ATL 2016
Tanvi Motwani, Lead Data Scientist, Guided Search at A9.com at MLconf ATL 2016Tanvi Motwani, Lead Data Scientist, Guided Search at A9.com at MLconf ATL 2016
Tanvi Motwani, Lead Data Scientist, Guided Search at A9.com at MLconf ATL 2016
 
Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016
Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016
Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
 
Practical pig
Practical pigPractical pig
Practical pig
 

Similar to Ted Willke, Intel Labs MLconf 2013

Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
 
How Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeHow Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeDenny Lee
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AIJames Serra
 
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)Sascha Wenninger
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBDenny Lee
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101MongoDB
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesQubole
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empowerDurga Gadiraju
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)Amazon Web Services
 

Similar to Ted Willke, Intel Labs MLconf 2013 (20)

Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
How Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeHow Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On Time
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
 

More from MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceMLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionMLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLMLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeMLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf
 

More from MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Ted Willke, Intel Labs MLconf 2013

  • 2. Machine Learning may nourish the soul… Source: Wikipedia (Banquet) ... but Data Preparation will consume it. Source: Wikipedia (Hell)
  • 3. Machine Learning on Large Datasets Data Quality and Feature Engineering New Data Input Data Feature Data Extract Transform Load Argghh! • • • • Training Set Build Model Validate Validation Set Test Set Value Figure out what’s there Extract a bunch of features Figure out what’s needed Finalize and feed Supervised Learning Supervised and Unsupervised Learning 3
  • 4. Problems with Processing Large Datasets Not turn-key Are data scientists really expected to know… how to set up Hadoop from scratch? java, pig, Hadoop APIs? how to extend with UDFs? how to extract, analyze and visualize output beyond Hadoop? “After hours of debugging our Hadoop setup, I was ecstatic to run a Hadoop command without a java stack trace.” - Zach
  • 5. Problems with Processing Large Datasets Not agile Traditional Environment Distributed Environment Command response < 1 sec > 30 sec Dependency inclusion Simple Several steps and changes Validation Established methods Not clear Development Cycle Fast Iteration Slow or linear
  • 6. Apache Pig • A dataflow processing system for MapReduce • A high-level scripting language -- Pig Latin
  • 7. Why Pig for ETL? • Easy to get up & running • Easy to program – simple declarative scripting language , built-in dataflow primitives • Nested data model support • First class extensibility – custom filters, transforms, input/output formats, etc. • Automatic dataflow optimization – Pig/MR runtime: ~0.97x for 0.12 • As configurable as MR
  • 8. The story gets even better • Elephant Bird – good support for different formats, codecs, etc. • DataFu – Pig UDFs for data mining & statistics • PiggyBank – collection of additional UDFs
  • 9. So, we’re done, right? No. Many open challenges, including complex models.
  • 10. Property Graph Data Models Source: Tinkerpop (Property Graph)
  • 11. Graph Applications Machine Learning • • • • • • • Neural Networks Deep Learning (RBM) Belief Propagation Label Propagation/ARW Collaborative Filtering (ALS, SGD, SVD) Topic Modeling (LDA) K-Means Mining • • • • • • • • PageRank Random Walk with Restart Connected Components Triangle Counting K-Truss Centrality Measures Network Diameter Degree Distribution Traversal (Search) • • Depth-First Search Breadth-First Search
  • 12. Graphical Machine Learning • Need fully-integrated solutions that are easy to program • Scale like Hadoop; speed and accuracy of in-memory graph analytics and mining • Enables applications in broadband services, network security, retail, life sciences, financial markets, etc. HDFS Intel Graph Builder on Graph Query Processing & Storage DB Web Docs Input Data Construct Graph Build Model Serve Model Insight & Prediction
  • 13. Graph Processing: Technology Challenges Performance – Has skyrocketed with in-memory and asynchronous graph engines and scalable graph query architectures Algorithms – A wide range of toolkits with graph mining and graphical machine learning algorithms, with more sophistication and scaled versions arriving “every day” Traction Data Models – Most large-scale work still on homogeneous graphs but property graphs and meta-path concepts are more widely discussed Programming – Challenging programming models in languages not popular with data scientists, IT developers, and other end-users Not so much Progress! Data Visualization – No great packages to visualize relationships du jour and interactive big data sampling and projection too crude & slow Data Preparation – Takes way too long, is way too manual, and is fraught with error Integration – Multiple frameworks are difficult to synchronize, coordinate, and manage Intel Labs continues to work on the gaps.
  • 14. Pig ETL for Graphs? Nothing specific for graph ETL. What’s needed: • support for well-known input-output graph formats • graph specific filters & transforms • STORE functions for graph stores Original Vision
  • 15. Graph Builder 2 Alpha • • • • Construction of heterogeneous information networks with Pig Better “progressive refinement” during acquisition, cleaning, and integration Incremental graph construction Interfacing for popular graph databases (Titan, RDF output, etc.) Product Graph Ratings Graph likes Bicycles likes likes likes Ted may like bicycle-powered food cart Food Cart uses likes likes Frank friends friends friends Ted Social Graph friends Mohit Ivy friends Kushal friends friends Nezih brothers Danny * Inspired by, “Titan: Rise of Big Graph Data,” by M. Rodriguez and M. Broecheler
  • 16. Raw Data RDBMS HDFS Example Stack Architecture Graph ETL Pig Graph Builder Real-time Graph Queries NFS Giraph Blueprints ML Mahout ZooKeeper Hadoop Gremlin Titan HBase HDFS Rexster Graph Analytics Feature Store Model Store
  • 17. Graph ETL Example Extract Transform Parse HTML, look for links and words { Archive Records http://www.1stvwparts.com/default.php?c Path=159 74.86.123.84 20091120145711 text/html 28628HTTP/1.1 200 OK <table border="0" width="100%" cellspacing="0" cellpadding="0"> <tr> <td width="100%" class="infoBoxHeading_search">Quick Find</td> </tr></table><table border="0" width="100%" cellspacing="0" cellpadding="0" class="infoBox_search"> <tr> <td><table border="0" width="100%" cellspacing="0" cellpadding="3“ . . . "url": "http://www.1stvwparts.com/default.php?c Path=159", "timestamp": "20091120145711", "links": ["http://www.1stvwparts.com/shopping_car t.php", "http://www.partsfirm.com", ...], "words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright“, “html”, “php”] } PageRank and Latent Dirichlet Allocation Load Graph Builder to Titan row src dst #links 67033:-20071306431384422339653 http://www.kog.com http://www.dlstainedglass.com 2 91658:-20071306431384422339653 http://www.kog.com http://www.haegerstainedglass.com 2 941:-19442631361384422339653 http://www.ks-p.jp http://www.drag-race.nuhuh.bee.pl 1 44116:-18273037921384422339653 http://www.kune.fr http://www.chezfanny.fr 3 36891:-18273037921384422339653 http://www.kune.fr http://www.wp-jobboard.kune.fr 3 79906:-17817899301384422339654 http://www.kwc.edu http://www.umsl-sports.com 1 2238:-17817799001384422339654 http://www.kwc.org http://www.onlamp.com 1 68133:-17817799001384422339654 http://www.kwc.org http://www.tjhsst.edu 1 30677:-17817799001384422339654 http://www.kwc.org http://www.floydlandis.com 1 81185:-17817799001384422339654 http://www.kwc.org http://www.you-are-here.com 1 47527:-17817799001384422339654 http://www.kwc.org http://www.phonak-cycling.ch 1 63112:-17817799001384422339654 http://www.kwc.org http://www.link.brightcove.com 1 74837:-17817799001384422339654 http://www.kwc.org http://www.trustbut.blogspot.com 6 53668:-17817799001384422339654 http://www.kwc.org http://www.icanhascheezburger.com 4 97945:-17817799001384422339654 http://www.kwc.org http://www.mythbustersfanclub.com 12 93849:-17709983361384422339654 http://www.kwmd.us http://www.sierraclub.typepad.com 1 51421:-17700453681384422339654 http://www.kwne.jp http://www.ppvj.co.jp 1 13022:-17651665521384422339654 http://www.kwu.edu http://www.rollinghillszoo.com 2 16530:-17113867601384422339654 http://www.kyou.nu http://www.fan.unfading-scar.net 2 14199:-16755866041384422339654 http://www.kzy.com http://www.wbbm780.com 1 95253:-16755866041384422339654 http://www.kzy.com http://www.brewview.com 1 25828:-14077538951384422339655 http://www.lee.org http://www.kaiju.com 1 88133:-14077538951384422339655 http://www.lee.org http://www.sfgov.org 2 94243:-14077538951384422339655 http://www.lee.org http://www.liftport.com 1 56826:-14077538951384422339655 http://www.lee.org http://www.nishioka.com 1 88574:-14077538951384422339655 http://www.lee.org http://www.Smartflix.com 1 81966:-14077538951384422339655 http://www.lee.org http://www.smartflix.com 145 83164:-14077538951384422339655 http://www.lee.org http://www.torrentspy.com 1 99087:-14077538951384422339655 http://www.lee.org http://www.SerpentMother.com 1 39124:-14077538951384422339655 http://www.lee.org http://www.serpentmother.com 3 95995:-14077538951384422339655 http://www.lee.org http://www.toolbar.google.com 2
  • 18. Development Pains with Pig As-Is Data Process Flow Development Flow (or, what actually happened) Load with Pig Turn into edge list (Pig, UDF) Store to HDFS (Pig) Load into Titan (GraphBuilder) Run ML algorithms (Giraph) Model queries (Gremlin) Extract with python Develop transforms Test on a couple files Fix bugs Run python in Jython (fail miserably) Spend too much time enabling Write UDF in Java Find limitations Develop custom load UDF instead … All of this before any Machine Learning!
  • 19. Out-of-the-Box Tools Custom UDFs add a lot of complexity, time and effort. If you don’t have this…. X = FOREACH A GENERATE TOKENIZE(f1); (More of these please) You’re stuck with this… package org.apache.pig.builtin; import import import import import import import import import java.io.IOException; java.util.StringTokenizer; org.apache.pig.EvalFunc; org.apache.pig.data.BagFactory; org.apache.pig.data.DataBag; org.apache.pig.data.Tuple; org.apache.pig.data.TupleFactory; org.apache.pig.impl.logicalLayer.schema.Schema; org.apache.pig.data.DataType; public class TOKENIZE extends EvalFunc<DataBag> { TupleFactory mTupleFactory = TupleFactory.g BagFactory mBagFactory = BagFactory.getInst public DataBag exec(Tuple input) throws IOE try { DataBag output = mB Object o = input.ge if (!(o instanceof throw n } StringTokenizer tok while (tok.hasMoreT return output; } catch (ExecException ee) { // error handling g } } public Schema outputSchema(Schema input) {
  • 20. Breadth of Knowledge Java Pig Load Raw Data Extract Links Filter Bad Data Group Like Links Together Store - HBase Store into Titan (Graph Builder) MapReduce
  • 21. Even if you have ninja skills, you’ll still need to deal with weirdness.
  • 22. Random Record { "url": "http://www.1stvwparts.com/default.php?cPath=159", "timestamp": "20091120145711", "words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti", "cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was", "hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials", "quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for", "sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod", "beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care", "search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector", "com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post", "cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store", "more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance", "account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact", "control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check", "registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online", "interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog", "date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the", "selectcabriocceoseurovangligolfgtijettajetta", "left"] }
  • 23. Random Record Uselessly common words { "url": "http://www.1stvwparts.com/default.php?cPath=159", "timestamp": "20091120145711", "words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti", "cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was", "hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials", "quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for", "sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod", "beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care", "search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector", "com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post", "cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store", "more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance", "account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact", "control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check", "registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online", "interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog", "date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the", "selectcabriocceoseurovangligolfgtijettajetta", "left"] } Common connector words can be trimmed …with a bunch more ETL.
  • 24. Random Record Words mangled together? { "url": "http://www.1stvwparts.com/default.php?cPath=159", "timestamp": "20091120145711", "words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti", "cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was", "hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials", "quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for", "sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod", "beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care", "search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector", "com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post", "cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store", "more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance", "account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact", "control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check", "registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online", "interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog", "date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the", "selectcabriocceoseurovangligolfgtijettajetta", "left"] } Is there an edge case that’s causing this?
  • 25. Random Record Were these actually visible? { "url": "http://www.1stvwparts.com/default.php?cPath=159", "timestamp": "20091120145711", "words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti", "cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was", "hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials", "quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for", "sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod", "beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care", "search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector", "com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post", "cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store", "more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance", "account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact", "control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check", "registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online", "interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog", "date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the", "selectcabriocceoseurovangligolfgtijettajetta", "left"] } “html” was found in every record, something seems wrong.
  • 26. raw_data = LOAD '/zach/common-crawl/1285409360731_9.arc.gz' USING ArcLoader() AS (header:chararray, html:chararray); edge_list = FOREACH raw_data GENERATE ExtractLinks(*); Load raw data Extract links edge_list_filtered = FILTER edge_list BY FilterAny(*); src_based = FOREACH edge_list_filtered GENERATE NormalizeURL(*, 0); src_based_cleaned = FILTER src_based BY FilterMalformedURL(*, 1); Filter & Normalize dest_based = FOREACH src_based_cleaned GENERATE NormalizeURL(*, 1); dest_based_self_loops_removed = FILTER dest_based BY FilterLoop(*); final = FILTER dest_based_self_loops_removed BY NOT (src_domain MATCHES '.*mailto.*' OR dest_domain MATCHES '.*mailto.*'); grouped = GROUP final BY (src_domain,dest_domain) PARALLEL 64; with_link_count = FOREACH grouped GENERATE group.src_domain, group.dest_domain, COUNT(final) AS num_links:long; Generate Link Counts with_hbase_keys = FOREACH with_link_count GENERATE RowKeyAssignerUDF(*); final_graph = FOREACH with_hbase_keys GENERATE FLATTEN($0) Assign HBase Keys AS (key:chararray, src_domain:chararray, dest_domain:chararray, num_links:long); STORE_GRAPH(final_graph, 'hbase://pagerank_edge_list', 'Titan'); Store into Titan
  • 27. Demo.
  • 28. Open Problems with Pig ETL (for Data Science)
  • 29. User Interface Interactive Mode Embedded Mode (Java, Python, etc.) LOAD Functions 1 Pig Scripting Interface Built in Functions and Operators Parser Data Type Support Planner STORE Functions Backend & Execution Engines Batch Mode Open source packages UDFs MR Jobs Complex JSON/XML processing is painful { "Top-Level-Field": "top_level", "Inner-Json": [{ "Name": "inner-name", "Value": 10 }]} json_data = LOAD 'test.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad'); unnested = FOREACH json_data GENERATE $0#'Top-Level-Field' AS (top_level_field_value: chararray), FLATTEN($0#'Inner-Json') AS (inner_json: map[]); unnested = FOREACH unnested GENERATE top_level_field_value, FLATTEN(inner_json#'Name') AS (inner_name: chararray), FLATTEN(inner_json#'Value') AS (inner_value:long);
  • 30. User Interface Interactive Mode 2 Embedded Mode (Java, Python, etc.) LOAD Functions Pig Scripting Interface Built in Functions and Operators Batch Mode Parser Data Type Support Planner STORE Functions Backend & Execution Engines Better high-level language integration Native-like experience with non-JVM languages (Python, R, etc.) REST interface can be improved (HCATALOG-182) Open source packages UDFs MR Jobs
  • 31. 3 Interactive Mode Embedded Mode (Java, Python, etc.) LOAD Functions Pig Scripting Interface Built in Functions and Operators Parser Data Type Support Planner STORE Functions Backend & Execution Engines Better data exploration & error reporting Faster iterative processing (Spark, YARN) Better SAMPLE (WIP: PIG-1713) SUMMARY for descriptive statistics More descriptive error messages Batch Mode Open source packages UDFs MR Jobs
  • 32. Embedded Mode (Java, Python, etc.) Interactive Mode 4 LOAD Functions Pig Scripting Interface Built in Functions and Operators Parser Data Type Support Planner STORE Functions Backend & Execution Engines Better control with HBaseStorage Inefficient for bulk loading Better HBase filter support Batching support Fetch multiple versions Batch Mode Open source packages UDFs MR Jobs
  • 33. Questions? • Graph Builder 2 Alpha Dec’13 • Apache 2.0 OS code available at: www.01.org/graphbuilder/
  • 34. Legal Notices • • • • • • • • • INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS. Intel may make changes to specifications and product descriptions at any time, without notice. All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Intel, Intel Inside, and the Intel logo are trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2013 Intel Corporation.
  • 35. Abstract Intel is working hard to build datacenter software from the silicon up that provides for a wide range of advanced analytics on Apache Hadoop. The Graph Analytics Operation within Intel Labs is helping to transform Hadoop into a full-blown “knowledge discovery platform” that can deftly process a wide range of data models, from simple tables to multi-property graphs, using sophisticated machine learning algorithms and data mining techniques. But, the analysis cannot start until features are engineered, a task that takes a lot of time and effort today. In this talk, I will describe some of the Hadoop-based tools we are developing to make it easier for data scientists to deal with data quality issues and construct features for scalable machine learning, including graph-based approaches