2. Machine Learning may
nourish the soul…
Source: Wikipedia (Banquet)
... but Data Preparation
will consume it.
Source: Wikipedia (Hell)
3. Machine Learning on Large Datasets
Data Quality and Feature Engineering
New
Data
Input
Data
Feature
Data
Extract
Transform
Load
Argghh!
•
•
•
•
Training Set
Build
Model
Validate
Validation
Set
Test
Set
Value
Figure out what’s there
Extract a bunch of features
Figure out what’s needed
Finalize and feed
Supervised Learning
Supervised and
Unsupervised Learning
3
4. Problems with Processing Large Datasets
Not turn-key
Are data scientists really expected to know…
how to set up Hadoop from scratch?
java, pig, Hadoop APIs?
how to extend with UDFs?
how to extract, analyze and visualize output beyond Hadoop?
“After hours of debugging our Hadoop setup, I was ecstatic to run a
Hadoop command without a java stack trace.”
- Zach
5. Problems with Processing Large Datasets
Not agile
Traditional Environment
Distributed Environment
Command response
< 1 sec
> 30 sec
Dependency inclusion
Simple
Several steps and changes
Validation
Established methods
Not clear
Development Cycle
Fast Iteration
Slow or linear
6. Apache Pig
• A dataflow processing system for MapReduce
• A high-level scripting language -- Pig Latin
7. Why Pig for ETL?
• Easy to get up & running
• Easy to program – simple declarative scripting
language , built-in dataflow primitives
• Nested data model support
• First class extensibility – custom filters,
transforms, input/output formats, etc.
• Automatic dataflow optimization – Pig/MR runtime:
~0.97x for 0.12
• As configurable as MR
8. The story gets even better
• Elephant Bird – good support for different formats,
codecs, etc.
• DataFu – Pig UDFs for data mining & statistics
• PiggyBank – collection of additional UDFs
9. So, we’re done, right?
No. Many open challenges,
including complex models.
12. Graphical Machine Learning
• Need fully-integrated solutions that are easy to program
• Scale like Hadoop; speed and accuracy of in-memory graph analytics and mining
• Enables applications in broadband services, network security, retail, life sciences,
financial markets, etc.
HDFS
Intel
Graph Builder on
Graph
Query
Processing &
Storage
DB
Web
Docs
Input Data
Construct Graph
Build Model
Serve Model
Insight &
Prediction
13. Graph Processing: Technology Challenges
Performance – Has skyrocketed with in-memory and asynchronous
graph engines and scalable graph query architectures
Algorithms – A wide range of toolkits with graph mining and graphical
machine learning algorithms, with more sophistication and scaled
versions arriving “every day”
Traction
Data Models – Most large-scale work still on homogeneous graphs but
property graphs and meta-path concepts are more widely discussed
Programming – Challenging programming models in languages not
popular with data scientists, IT developers, and other end-users
Not so much
Progress!
Data Visualization – No great packages to visualize relationships du
jour and interactive big data sampling and projection too crude & slow
Data Preparation – Takes way too long, is way too manual, and is
fraught with error
Integration – Multiple frameworks are difficult to synchronize,
coordinate, and manage
Intel Labs continues to work on the gaps.
14. Pig ETL for Graphs?
Nothing specific for graph ETL. What’s needed:
• support for well-known input-output graph formats
• graph specific filters & transforms
• STORE functions for graph stores
Original
Vision
15. Graph Builder 2 Alpha
•
•
•
•
Construction of heterogeneous information networks with Pig
Better “progressive refinement” during acquisition, cleaning, and integration
Incremental graph construction
Interfacing for popular graph databases (Titan, RDF output, etc.)
Product Graph
Ratings Graph
likes
Bicycles
likes
likes
likes
Ted may like
bicycle-powered food cart
Food
Cart
uses
likes
likes
Frank
friends
friends
friends
Ted
Social Graph
friends
Mohit
Ivy
friends
Kushal
friends
friends
Nezih
brothers
Danny
* Inspired by, “Titan: Rise of Big Graph Data,” by M. Rodriguez and M. Broecheler
16. Raw Data
RDBMS
HDFS
Example Stack Architecture
Graph ETL
Pig
Graph
Builder
Real-time Graph Queries
NFS
Giraph
Blueprints
ML
Mahout
ZooKeeper
Hadoop
Gremlin
Titan
HBase
HDFS
Rexster
Graph
Analytics
Feature
Store
Model
Store
18. Development Pains with Pig As-Is
Data Process Flow
Development Flow
(or, what actually happened)
Load with Pig
Turn into edge list (Pig, UDF)
Store to HDFS (Pig)
Load into Titan (GraphBuilder)
Run ML algorithms (Giraph)
Model queries (Gremlin)
Extract with python
Develop transforms
Test on a couple files
Fix bugs
Run python in Jython (fail miserably)
Spend too much time enabling
Write UDF in Java
Find limitations
Develop custom load UDF instead
…
All of this before any Machine Learning!
19. Out-of-the-Box Tools
Custom UDFs add a lot of complexity, time and effort.
If you don’t have this….
X = FOREACH A GENERATE
TOKENIZE(f1);
(More of these please)
You’re stuck with this…
package org.apache.pig.builtin;
import
import
import
import
import
import
import
import
import
java.io.IOException;
java.util.StringTokenizer;
org.apache.pig.EvalFunc;
org.apache.pig.data.BagFactory;
org.apache.pig.data.DataBag;
org.apache.pig.data.Tuple;
org.apache.pig.data.TupleFactory;
org.apache.pig.impl.logicalLayer.schema.Schema;
org.apache.pig.data.DataType;
public class TOKENIZE extends EvalFunc<DataBag> {
TupleFactory mTupleFactory = TupleFactory.g
BagFactory mBagFactory = BagFactory.getInst
public DataBag exec(Tuple input) throws IOE
try {
DataBag output = mB
Object o = input.ge
if (!(o instanceof
throw n
}
StringTokenizer tok
while (tok.hasMoreT
return output;
} catch (ExecException ee) {
// error handling g
}
}
public Schema outputSchema(Schema input) {
20. Breadth of Knowledge
Java
Pig
Load Raw Data
Extract Links
Filter Bad Data
Group Like
Links Together
Store - HBase
Store into Titan (Graph Builder)
MapReduce
21. Even if you have ninja skills, you’ll
still need to deal with weirdness.
29. User
Interface
Interactive Mode
Embedded Mode
(Java, Python, etc.)
LOAD Functions
1
Pig Scripting Interface
Built in Functions
and Operators
Parser
Data Type Support
Planner
STORE Functions
Backend & Execution
Engines
Batch Mode
Open source
packages
UDFs
MR Jobs
Complex JSON/XML processing is painful
{ "Top-Level-Field": "top_level", "Inner-Json": [{ "Name": "inner-name", "Value": 10 }]}
json_data = LOAD 'test.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
unnested = FOREACH json_data GENERATE $0#'Top-Level-Field' AS (top_level_field_value: chararray),
FLATTEN($0#'Inner-Json') AS (inner_json: map[]);
unnested = FOREACH unnested GENERATE top_level_field_value,
FLATTEN(inner_json#'Name') AS (inner_name: chararray),
FLATTEN(inner_json#'Value') AS (inner_value:long);
30. User
Interface
Interactive Mode
2
Embedded Mode
(Java, Python, etc.)
LOAD Functions
Pig Scripting Interface
Built in Functions
and Operators
Batch Mode
Parser
Data Type Support
Planner
STORE Functions
Backend & Execution
Engines
Better high-level language integration
Native-like experience with non-JVM languages (Python, R, etc.)
REST interface can be improved (HCATALOG-182)
Open source
packages
UDFs
MR Jobs
31. 3
Interactive Mode
Embedded Mode
(Java, Python, etc.)
LOAD Functions
Pig Scripting Interface
Built in Functions
and Operators
Parser
Data Type Support
Planner
STORE Functions
Backend & Execution
Engines
Better data exploration & error reporting
Faster iterative processing (Spark, YARN)
Better SAMPLE (WIP: PIG-1713)
SUMMARY for descriptive statistics
More descriptive error messages
Batch Mode
Open source
packages
UDFs
MR Jobs
32. Embedded Mode
(Java, Python, etc.)
Interactive Mode
4
LOAD Functions
Pig Scripting Interface
Built in Functions
and Operators
Parser
Data Type Support
Planner
STORE Functions
Backend & Execution
Engines
Better control with HBaseStorage
Inefficient for bulk loading
Better HBase filter support
Batching support
Fetch multiple versions
Batch Mode
Open source
packages
UDFs
MR Jobs
35. Abstract
Intel is working hard to build datacenter software from the silicon up that
provides for a wide range of advanced analytics on Apache Hadoop. The
Graph Analytics Operation within Intel Labs is helping to transform Hadoop
into a full-blown “knowledge discovery platform” that can deftly process a
wide range of data models, from simple tables to multi-property graphs,
using sophisticated machine learning algorithms and data mining
techniques. But, the analysis cannot start until features are engineered, a
task that takes a lot of time and effort today. In this talk, I will describe
some of the Hadoop-based tools we are developing to make it easier for
data scientists to deal with data quality issues and construct features for
scalable machine learning, including graph-based approaches