SlideShare a Scribd company logo
1 of 60
What the #(&*$ is Big Data?
A Holistic View of Data and
Algorithms
Alice Zheng, GraphLab
Strata Conference, Santa Clara
February, 2014
Background
• Machine Learning
• Enable machines to understand the world
• Play with data
• GraphLab
• Unleash data science!
• Enable non-ML experts to play with data
• This talk: a look at Big Data and Machine
Learning from a tool builder’s perspective
Strata Conf, Feb 2014 2
DATA
Strata Conf, Feb 2014
What is Data?
• Data is an extension of ourselves
• Pictures, texts, messages, logs
• Sensors and devices
• Measurements and experiments
• Data is organic; it is wild and messy
• Data proliferates
Strata Conf, Feb 2014 4
Producers of Big Data
• Tech industry
• Google, Microsoft, Facebook, Amazon, Twitter, …
• Consumer/Retail
• Walmart, Target, Amazon, Netflix, …
• Telecomm
• Verizon, AT&T, Telefonica, …
• Finance
• Thomson Reuters, Dow Jones, …
• Health care and monitoring
• Personal health metrics, health care records, …
• Science
• Genome research, high energy physics, astronomy, NASA, …
• Etc.
Strata Conf, Feb 2014 5
• 1.11 billion active users [March 2013]
• 665 million daily users on average [March 2013]
• Daily data amount: [Aug 2012]
• 500+ TB data
• 2.5 billion pieces of content
• 2.7 billion “Like” actions
• 300 mil photos
• Scans 105 TB data every ½ hour
• 100+ PB data stored on a single Hadoop
cluster [Aug 2012]
Strata Conf, Feb 2014 6
Data Sources: [Yahoo! news] [TechCrunch]
System Event Logs
ETW (Event Tracing for Windows)
• Logs of kernel and application events
• Up to 100K events per second
• Binary log size: ~200 MB every 2-5
minutes
• 20-50 TB/year from one machine
• ~50 PB/year from 1000 machines
Strata Conf, Feb 2014 7
Data source: http://msdn.microsoft.com/en-us/library/windows/desktop/bb968803%28v=vs.85%29.aspx
A Picture of Big Data
Strata Conf, Feb 2014 8
WikipediaWebSpam
Sys Logs
Walmart
LHC
Whole
Genome Scans
SDSS
Flickr
Cellphone
CDRs
Facebook
Twitter
GB
TB
PB
EB
Total Size / Year
Structure
Science
Tech
Size of bubble =
Size of a single
record (log-scale)
Other
TAKING THE LEAP
Strata Conf, Feb 2014 9
ALGORITHMS
Strata Conf, Feb 2014 10
The Way to Insight
• What do people do with Big Data?
• Myriad algorithms for myriad tasks
• Two disparate examples
• What movies would Bob like? – discovering
recommendations from a crowd
• Why is my machine so slow? – diagnosing
systems using event logs
Strata Conf, Feb 2014 11
Algorithm Example 1:
A Recommender System
Strata Conf, Feb 2014
What Movies Would Bob Like?
• Bob watched “Silver Linings Playbook”
and “Twin Peaks.” What else might Bob
like?
• Given movie selections of many users,
make recommendations for individuals
Strata Conf, Feb 2014
User-Movie Interaction Matrix
Silver
Linings
Playbook
Hunger
Games
Twin Peaks Iron Man 3 Mulholland
Drive
Bob
Anna
David
Ethan
Strata Conf, Feb 2014
Finding Similar Movies
• Jaccard similarity between a pair of movies
num users who watched both
num users who watched either
• If every user who watched one or the other
movie, ends up watching both, then the two
movies must be very similar.
Strata Conf, Feb 2014
User-Movie Interaction Matrix
Silver
Linings
Playbook
Hunger
Games
Twin Peaks Iron Man 3 Mulholland
Drive
Bob
Anna
David
Ethan
Strata Conf, Feb 2014
Sim(“Silver Linings Playbook”, “Hunger Games”) = ?
User-Movie Interaction Matrix
Silver
Linings
Playbook
Hunger
Games
Twin Peaks Iron Man 3 Mulholland
Drive
Bob
Anna
David
Ethan
Strata Conf, Feb 2014
Sim(“Silver Linings Playbook”, “Hunger Games”) = ?
User-Movie Interaction Matrix
Silver
Linings
Playbook
Hunger
Games
Twin Peaks Iron Man 3 Mulholland
Drive
Bob
Anna
David
Ethan
Strata Conf, Feb 2014
Sim(“Silver Linings Playbook”, “Hunger Games”) = ?
User-Movie Interaction Matrix
Silver
Linings
Playbook
Hunger
Games
Twin Peaks Iron Man 3 Mulholland
Drive
Bob
Anna
David
Ethan
Strata Conf, Feb 2014
Sim(“Silver Linings Playbook”, “Hunger Games”) = 1/3
Movie Similarity Matrix
Strata Conf, Feb 2014
Silver
Linings
Playbook
Hunger
Games
Twin Peaks Iron Man 3 Mulholland
Drive
Silver
Linings
Playbook
1 1/3 2/3 0 1/3
Hunger
Games
1/3 1 1/4 0 1/3
Twin Peaks 2/3 1/4 1 0 2/3
Iron Man 3 0 0 0 1 0
Mulholland
Drive
1/3 1/3 2/3 0 1
Making New Recommendations
recs = [ ]
for movie in user.preferences:
new_movies = Sim[movie, :].topk( )
recs.append(new_movies)
recs.sort()
• Equivalently, take the vector-matrix product
• vector = the user’s preferences
• matrix = movie similarity matrix
Strata Conf, Feb 2014
Key Ideas
• During training: compute item-item
similarity matrix
• Making recommendations: take vector-
matrix product
Strata Conf, Feb 2014
Algorithm Example 2:
Diagnosing a slow computer
Strata Conf, Feb 2014
Why is My Machine So Slow?
• Slow machines are frustrating!
• Diagnose slowness via event logs
ETW – Event Tracing for Windows
• Fine-grained event tracing
• Up to 100,000 events per second
Strata Conf, Feb 2014 25
Excerpt of Sample ETW log
Diagnosing Slowness
• Start from slow thread
• Walk backwards to construct wait graph
Strata Conf, Feb 2014
Firefox
Time
Network Stack
TCP/IP packet
Search Indexer
File Lock
Anti-Virus Checker
File Lock
Key Algorithm Ideas
• The insight is a wait graph
• Constructing the graph involves repeated
queries into a large set of events
• Iterate:
• What was the current thread waiting on?
• Go to the source of the wait
Strata Conf, Feb 2014
What links these algorithms and data?
Strata Conf, Feb 2014
DATA STRUCTURES
– THE BRIDGE
Strata Conf, Feb 2014
Between Data and Algorithms
• Data structures
• Organized data
• Optimized for certain computations
• The key to efficient analysis
• Algorithms prefer certain data structures
• Raw data is amenable to certain data structures
Data Algorithms
Data
Structures
Amenable Preference
The Disconnect
• Machine Learning research – largely disconnected
from implementation
• Some recent advances in large-scale ML are rediscovering
known data structures
• Next-gen ML tools need well-tailored data structures
Strata Conf, Feb 2014
Machine Learning
(Statistics, optimization,
linear algebra, …)
Data Structures
(Lists, trees,
tables, graphs, …)
Two Useful Data Structures
• Flat tables
• Graphs
Strata Conf, Feb 2014
Data Structure 1: Flat Table
Strata Conf, Feb 2014
Flat Tables
• Rows and columns
• Rows = records
• Columns can be typed
• A lot of raw data looks like flat tables!
Strata Conf, Feb 2014
Example 1
User Item Rating Time
Alice Breaking Bad, Season 1 3 …
Charlie Twilight 2
Bob Silver Linings Playbook 4
Frank American Hustle 2
Tina Plan 9 From Outer Space 4
Bob Twin Peaks 2
Diana Dr. Strangelove 5
…
Strata Conf, Feb 2014
User-Item interaction data
Example 2
Timestamp Name PID CPU Stack …
447590409 audiodg.exe 1848 1 ntkrnlpa.exe!KeSetEvent
ntkrnlpa.exe!WaitForLock
447590411 csrss.exe 460 0 …
447590415 iexplore.exe 2478 1 kernel64.exe!WaitForMultipleObjects
…
Strata Conf, Feb 2014
Event log data
Variations of Flat Tables
• Query vs. computation
• Random access (in-memory) vs.
sequential access (on-disk)
• Column vs. row-wise representation
• Indexed or not
• Distributed or not
• Key-value stores (hash tables)
Strata Conf, Feb 2014
Data Structure 1.5: Indexed Flat Table
Strata Conf, Feb 2014
Example of Indexed Flat Table
Strata Conf, Feb 2014
User Item Rating
Alice Breaking Bad, Season 1 3
Charlie Twilight 2
Bob Silver Linings Playbook 4
Frank American Hustle 2
Tina Plan 9 From Outer Space 4
Bob Twin Peaks 2
Diana Dr. Strangelove 5
…
Example of Indexed Flat Table
Strata Conf, Feb 2014
User Item Rating
Alice Breaking Bad, Season 1 3
Charlie Twilight 2
Bob Silver Linings Playbook 4
Frank American Hustle 2
Tina Plan 9 From Outer Space 4
Bob Twin Peaks 2
Diana Dr. Strangelove 5
…
Index
Query: What items did Bob rate?
Example of Indexed Flat Table
Strata Conf, Feb 2014
User Item Rating
Alice Breaking Bad, Season 1 3
Charlie Twilight 2
Bob Silver Linings Playbook 4
Frank American Hustle 2
Tina Plan 9 From Outer Space 4
Bob Twin Peaks 2
Diana Dr. Strangelove 5
…
Index
Query: What items did Bob rate?
Index of “Bob” points to rows 3 and 6
Back to the Recommender
• Training: compute a matrix
• Recommending: vector-matrix product
• Raw data: user-item interaction log
• Load in as flat table
• Build index (user-item matrix)
• Iterate through the users to train
Strata Conf, Feb 2014
ML on Flat Tables
• Anything where data is represented as
feature vectors
• Computations operate on rows
• Stochastic gradient descent
• K-means clustering
• … or columns
• Decision tree family
Strata Conf, Feb 2014
Data Structure 2: Graph
Strata Conf, Feb 2014
Example
Strata Conf, Feb 2014
Anna
Diana
Charlie
Frank
Tina
Bob
Sam
Implementation 1: Edge List
• A simple flat table!
• Additional columns = edge attributes (e.g., user rating
of movie, time watched, etc.)
Strata Conf, Feb 2014
User Item
Alice Breaking Bad, Season 1
Charlie Twilight
Bob Silver Linings Playbook
Frank American Hustle
Tina Plan 9 From Outer Space
Bob Twin Peaks
Diana Dr. Strangelove
…
Implementation 2:
Edge List + Vertex List
• Two flat tables
• Pre-computed join on VertexID
Strata Conf, Feb 2014
VertexID Name Age Genre
1 Alice 50
2 Charlie 26
3 Bob 33
…
100001 Silver Linings Playbook Romance
100002 Iron Man 3 Action
100003 Twin Peaks Thriller
SrcVertex DstVertex
1 389944
2 136782
3 100001
4 572639
5 200835
3 100003
…
Graph Operations
• get_neighbors():
1. Query indexed flat table
Strata Conf, Feb 2014
Example of Indexed Flat Table
Strata Conf, Feb 2014
User Item Rating
Alice Breaking Bad, Season 1 3
Charlie Twilight 2
Bob Silver Linings Playbook 4
Frank American Hustle 2
Tina Plan 9 From Outer Space 4
Bob Twin Peaks 2
Diana Dr. Strangelove 5
…
Index
Query: What items did Bob rate?
Index of “Bob” points to rows 3 and 6
Graph Operations
• get_neighbors():
1. Query indexed flat table
2. Join with vertex table on VertexID or Name
Strata Conf, Feb 2014
User Movie Rating
Bob Silver Linings Playbook 4
Bob Twin Peaks 2
VertexID Name Age Genre
3 Bob 33
100001 Silver Linings Playbook Romance
100003 Twin Peaks Thriller
Graph Operations
• get_subgraph():
• get_neighbors(), instantiate new table with subset of
rows of old tables
• Find edges/vertices with attribute = x
• Filter old tables
• Hypergraph – edges span more than 2 vertices
• Just add more columns to the edge table
Strata Conf, Feb 2014
Back to Syslog Mining
• Wait graph construction = search and filter
• Iterate:
• get_neighbors()
• filter on edge and vertex attribute to find culprits
• Sequential process
• Underlying event graph is enormous
• SLOW
Strata Conf, Feb 2014
ML on Graphs
• Graphical models (Bayes nets)
• Belief propagation
• Gibbs sampling
• Random walk on Markov chains
• PageRank
• Some algos are implementable on either
• Matrix factorization
Strata Conf, Feb 2014
Graphs vs. Tables
Strata Santa Clara, Feb 2014
Tables
Graphs
Graphs vs. Tables
• Closely related
• Graphs can be implemented on top of tables
• … yet different
• What key operations to optimize
• How much to pre-compute
• Indexes
• Joins
• Filters
Strata Santa Clara, Feb 2014
Popular Implementations
Strata Santa Clara, Feb 2014
Flat Tables
Strata Conf, Feb 2014
Random Access
(In Memory)
Sequential Access
(On Disk)
Querying
(Interactive)
Computation
(Batch)
Pandas
Spark
SQL
Hive/Pig
GraphLab
SFrame
Graphs
Strata Conf, Feb 2014
Random Access
(In-Memory)
Sequential Access
(On disk)
Querying
(Interactive)
Computation
(Batch)
GraphLab
Graph
GraphChi
Graph
GraphDBs:
HyperGraphDB,
Titan, Neo4j
Giraph
Conclusions
• Fast and scalable analysis hinges upon
efficient data structures
• Match the algo to the data structure
• Morph raw data into the data structure
Strata Conf, Feb 2014
Raw Data
Data
Structure
Algorithm Insight
Advertising
• GraphLab Tutorial this afternoon!
• “Large Scale Machine Learning Cookbook
Using GraphLab”
• Ballroom G, 1:30pm—5pm
Strata Santa Clara, Feb 2014

More Related Content

What's hot

Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaMagellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Spark Summit
 
Automated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskAutomated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with Dask
ASI Data Science
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 

What's hot (20)

Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with Azure
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
Networks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlowNetworks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlow
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data science
 
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaMagellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MD
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData Ecosystem
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
Automated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskAutomated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with Dask
 
Spark
SparkSpark
Spark
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
 
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
 Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ... Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
 

Viewers also liked

An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
PyData
 
Interpreting charts and graphs
Interpreting charts and graphsInterpreting charts and graphs
Interpreting charts and graphs
lesliejohnson441
 

Viewers also liked (7)

The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
 
Understanding Feature Space in Machine Learning
Understanding Feature Space in Machine LearningUnderstanding Feature Space in Machine Learning
Understanding Feature Space in Machine Learning
 
Feature engineering for diverse data types
Feature engineering for diverse data typesFeature engineering for diverse data types
Feature engineering for diverse data types
 
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
 
Interpreting charts and graphs
Interpreting charts and graphsInterpreting charts and graphs
Interpreting charts and graphs
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 

Similar to What the Bleep is Big Data? A Holistic View of Data and Algorithms

How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
Databricks
 

Similar to What the Bleep is Big Data? A Holistic View of Data and Algorithms (20)

Data Matters for AGU Early Career Conference
Data Matters for AGU Early Career ConferenceData Matters for AGU Early Career Conference
Data Matters for AGU Early Career Conference
 
Bren - UCSB - Spooky spreadsheets
Bren - UCSB - Spooky spreadsheetsBren - UCSB - Spooky spreadsheets
Bren - UCSB - Spooky spreadsheets
 
Preventing data loss
Preventing data lossPreventing data loss
Preventing data loss
 
Data Stewardship for SPATIAL/IsoCamp 2014
Data Stewardship for SPATIAL/IsoCamp 2014Data Stewardship for SPATIAL/IsoCamp 2014
Data Stewardship for SPATIAL/IsoCamp 2014
 
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
 
Presentation1.pdf
Presentation1.pdfPresentation1.pdf
Presentation1.pdf
 
Immersive Recommendation
Immersive RecommendationImmersive Recommendation
Immersive Recommendation
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data Management
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
DataHub
DataHubDataHub
DataHub
 
Infusing Social Data Analytics into Future Internet applications for Manufact...
Infusing Social Data Analytics into Future Internet applications for Manufact...Infusing Social Data Analytics into Future Internet applications for Manufact...
Infusing Social Data Analytics into Future Internet applications for Manufact...
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
WebServices_Grid.ppt
WebServices_Grid.pptWebServices_Grid.ppt
WebServices_Grid.ppt
 
Tdwg14 fp-kurator-ludaescher
Tdwg14 fp-kurator-ludaescherTdwg14 fp-kurator-ludaescher
Tdwg14 fp-kurator-ludaescher
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
 
How to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data TeamHow to Feed a Data Hungry Organization – by Traveloka Data Team
How to Feed a Data Hungry Organization – by Traveloka Data Team
 

Recently uploaded

DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 

Recently uploaded (20)

Introduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfIntroduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdf
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
fitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .pptfitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .ppt
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...
Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...
Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
 
Computer Graphics Introduction To Curves
Computer Graphics Introduction To CurvesComputer Graphics Introduction To Curves
Computer Graphics Introduction To Curves
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 

What the Bleep is Big Data? A Holistic View of Data and Algorithms

  • 1. What the #(&*$ is Big Data? A Holistic View of Data and Algorithms Alice Zheng, GraphLab Strata Conference, Santa Clara February, 2014
  • 2. Background • Machine Learning • Enable machines to understand the world • Play with data • GraphLab • Unleash data science! • Enable non-ML experts to play with data • This talk: a look at Big Data and Machine Learning from a tool builder’s perspective Strata Conf, Feb 2014 2
  • 4. What is Data? • Data is an extension of ourselves • Pictures, texts, messages, logs • Sensors and devices • Measurements and experiments • Data is organic; it is wild and messy • Data proliferates Strata Conf, Feb 2014 4
  • 5. Producers of Big Data • Tech industry • Google, Microsoft, Facebook, Amazon, Twitter, … • Consumer/Retail • Walmart, Target, Amazon, Netflix, … • Telecomm • Verizon, AT&T, Telefonica, … • Finance • Thomson Reuters, Dow Jones, … • Health care and monitoring • Personal health metrics, health care records, … • Science • Genome research, high energy physics, astronomy, NASA, … • Etc. Strata Conf, Feb 2014 5
  • 6. • 1.11 billion active users [March 2013] • 665 million daily users on average [March 2013] • Daily data amount: [Aug 2012] • 500+ TB data • 2.5 billion pieces of content • 2.7 billion “Like” actions • 300 mil photos • Scans 105 TB data every ½ hour • 100+ PB data stored on a single Hadoop cluster [Aug 2012] Strata Conf, Feb 2014 6 Data Sources: [Yahoo! news] [TechCrunch]
  • 7. System Event Logs ETW (Event Tracing for Windows) • Logs of kernel and application events • Up to 100K events per second • Binary log size: ~200 MB every 2-5 minutes • 20-50 TB/year from one machine • ~50 PB/year from 1000 machines Strata Conf, Feb 2014 7 Data source: http://msdn.microsoft.com/en-us/library/windows/desktop/bb968803%28v=vs.85%29.aspx
  • 8. A Picture of Big Data Strata Conf, Feb 2014 8 WikipediaWebSpam Sys Logs Walmart LHC Whole Genome Scans SDSS Flickr Cellphone CDRs Facebook Twitter GB TB PB EB Total Size / Year Structure Science Tech Size of bubble = Size of a single record (log-scale) Other
  • 9. TAKING THE LEAP Strata Conf, Feb 2014 9
  • 11. The Way to Insight • What do people do with Big Data? • Myriad algorithms for myriad tasks • Two disparate examples • What movies would Bob like? – discovering recommendations from a crowd • Why is my machine so slow? – diagnosing systems using event logs Strata Conf, Feb 2014 11
  • 12. Algorithm Example 1: A Recommender System Strata Conf, Feb 2014
  • 13. What Movies Would Bob Like? • Bob watched “Silver Linings Playbook” and “Twin Peaks.” What else might Bob like? • Given movie selections of many users, make recommendations for individuals Strata Conf, Feb 2014
  • 14. User-Movie Interaction Matrix Silver Linings Playbook Hunger Games Twin Peaks Iron Man 3 Mulholland Drive Bob Anna David Ethan Strata Conf, Feb 2014
  • 15. Finding Similar Movies • Jaccard similarity between a pair of movies num users who watched both num users who watched either • If every user who watched one or the other movie, ends up watching both, then the two movies must be very similar. Strata Conf, Feb 2014
  • 16. User-Movie Interaction Matrix Silver Linings Playbook Hunger Games Twin Peaks Iron Man 3 Mulholland Drive Bob Anna David Ethan Strata Conf, Feb 2014 Sim(“Silver Linings Playbook”, “Hunger Games”) = ?
  • 17. User-Movie Interaction Matrix Silver Linings Playbook Hunger Games Twin Peaks Iron Man 3 Mulholland Drive Bob Anna David Ethan Strata Conf, Feb 2014 Sim(“Silver Linings Playbook”, “Hunger Games”) = ?
  • 18. User-Movie Interaction Matrix Silver Linings Playbook Hunger Games Twin Peaks Iron Man 3 Mulholland Drive Bob Anna David Ethan Strata Conf, Feb 2014 Sim(“Silver Linings Playbook”, “Hunger Games”) = ?
  • 19. User-Movie Interaction Matrix Silver Linings Playbook Hunger Games Twin Peaks Iron Man 3 Mulholland Drive Bob Anna David Ethan Strata Conf, Feb 2014 Sim(“Silver Linings Playbook”, “Hunger Games”) = 1/3
  • 20. Movie Similarity Matrix Strata Conf, Feb 2014 Silver Linings Playbook Hunger Games Twin Peaks Iron Man 3 Mulholland Drive Silver Linings Playbook 1 1/3 2/3 0 1/3 Hunger Games 1/3 1 1/4 0 1/3 Twin Peaks 2/3 1/4 1 0 2/3 Iron Man 3 0 0 0 1 0 Mulholland Drive 1/3 1/3 2/3 0 1
  • 21. Making New Recommendations recs = [ ] for movie in user.preferences: new_movies = Sim[movie, :].topk( ) recs.append(new_movies) recs.sort() • Equivalently, take the vector-matrix product • vector = the user’s preferences • matrix = movie similarity matrix Strata Conf, Feb 2014
  • 22. Key Ideas • During training: compute item-item similarity matrix • Making recommendations: take vector- matrix product Strata Conf, Feb 2014
  • 23. Algorithm Example 2: Diagnosing a slow computer Strata Conf, Feb 2014
  • 24. Why is My Machine So Slow? • Slow machines are frustrating! • Diagnose slowness via event logs
  • 25. ETW – Event Tracing for Windows • Fine-grained event tracing • Up to 100,000 events per second Strata Conf, Feb 2014 25 Excerpt of Sample ETW log
  • 26. Diagnosing Slowness • Start from slow thread • Walk backwards to construct wait graph Strata Conf, Feb 2014 Firefox Time Network Stack TCP/IP packet Search Indexer File Lock Anti-Virus Checker File Lock
  • 27. Key Algorithm Ideas • The insight is a wait graph • Constructing the graph involves repeated queries into a large set of events • Iterate: • What was the current thread waiting on? • Go to the source of the wait Strata Conf, Feb 2014
  • 28. What links these algorithms and data? Strata Conf, Feb 2014
  • 29. DATA STRUCTURES – THE BRIDGE Strata Conf, Feb 2014
  • 30. Between Data and Algorithms • Data structures • Organized data • Optimized for certain computations • The key to efficient analysis • Algorithms prefer certain data structures • Raw data is amenable to certain data structures Data Algorithms Data Structures Amenable Preference
  • 31. The Disconnect • Machine Learning research – largely disconnected from implementation • Some recent advances in large-scale ML are rediscovering known data structures • Next-gen ML tools need well-tailored data structures Strata Conf, Feb 2014 Machine Learning (Statistics, optimization, linear algebra, …) Data Structures (Lists, trees, tables, graphs, …)
  • 32. Two Useful Data Structures • Flat tables • Graphs Strata Conf, Feb 2014
  • 33. Data Structure 1: Flat Table Strata Conf, Feb 2014
  • 34. Flat Tables • Rows and columns • Rows = records • Columns can be typed • A lot of raw data looks like flat tables! Strata Conf, Feb 2014
  • 35. Example 1 User Item Rating Time Alice Breaking Bad, Season 1 3 … Charlie Twilight 2 Bob Silver Linings Playbook 4 Frank American Hustle 2 Tina Plan 9 From Outer Space 4 Bob Twin Peaks 2 Diana Dr. Strangelove 5 … Strata Conf, Feb 2014 User-Item interaction data
  • 36. Example 2 Timestamp Name PID CPU Stack … 447590409 audiodg.exe 1848 1 ntkrnlpa.exe!KeSetEvent ntkrnlpa.exe!WaitForLock 447590411 csrss.exe 460 0 … 447590415 iexplore.exe 2478 1 kernel64.exe!WaitForMultipleObjects … Strata Conf, Feb 2014 Event log data
  • 37. Variations of Flat Tables • Query vs. computation • Random access (in-memory) vs. sequential access (on-disk) • Column vs. row-wise representation • Indexed or not • Distributed or not • Key-value stores (hash tables) Strata Conf, Feb 2014
  • 38. Data Structure 1.5: Indexed Flat Table Strata Conf, Feb 2014
  • 39. Example of Indexed Flat Table Strata Conf, Feb 2014 User Item Rating Alice Breaking Bad, Season 1 3 Charlie Twilight 2 Bob Silver Linings Playbook 4 Frank American Hustle 2 Tina Plan 9 From Outer Space 4 Bob Twin Peaks 2 Diana Dr. Strangelove 5 …
  • 40. Example of Indexed Flat Table Strata Conf, Feb 2014 User Item Rating Alice Breaking Bad, Season 1 3 Charlie Twilight 2 Bob Silver Linings Playbook 4 Frank American Hustle 2 Tina Plan 9 From Outer Space 4 Bob Twin Peaks 2 Diana Dr. Strangelove 5 … Index Query: What items did Bob rate?
  • 41. Example of Indexed Flat Table Strata Conf, Feb 2014 User Item Rating Alice Breaking Bad, Season 1 3 Charlie Twilight 2 Bob Silver Linings Playbook 4 Frank American Hustle 2 Tina Plan 9 From Outer Space 4 Bob Twin Peaks 2 Diana Dr. Strangelove 5 … Index Query: What items did Bob rate? Index of “Bob” points to rows 3 and 6
  • 42. Back to the Recommender • Training: compute a matrix • Recommending: vector-matrix product • Raw data: user-item interaction log • Load in as flat table • Build index (user-item matrix) • Iterate through the users to train Strata Conf, Feb 2014
  • 43. ML on Flat Tables • Anything where data is represented as feature vectors • Computations operate on rows • Stochastic gradient descent • K-means clustering • … or columns • Decision tree family Strata Conf, Feb 2014
  • 44. Data Structure 2: Graph Strata Conf, Feb 2014
  • 45. Example Strata Conf, Feb 2014 Anna Diana Charlie Frank Tina Bob Sam
  • 46. Implementation 1: Edge List • A simple flat table! • Additional columns = edge attributes (e.g., user rating of movie, time watched, etc.) Strata Conf, Feb 2014 User Item Alice Breaking Bad, Season 1 Charlie Twilight Bob Silver Linings Playbook Frank American Hustle Tina Plan 9 From Outer Space Bob Twin Peaks Diana Dr. Strangelove …
  • 47. Implementation 2: Edge List + Vertex List • Two flat tables • Pre-computed join on VertexID Strata Conf, Feb 2014 VertexID Name Age Genre 1 Alice 50 2 Charlie 26 3 Bob 33 … 100001 Silver Linings Playbook Romance 100002 Iron Man 3 Action 100003 Twin Peaks Thriller SrcVertex DstVertex 1 389944 2 136782 3 100001 4 572639 5 200835 3 100003 …
  • 48. Graph Operations • get_neighbors(): 1. Query indexed flat table Strata Conf, Feb 2014
  • 49. Example of Indexed Flat Table Strata Conf, Feb 2014 User Item Rating Alice Breaking Bad, Season 1 3 Charlie Twilight 2 Bob Silver Linings Playbook 4 Frank American Hustle 2 Tina Plan 9 From Outer Space 4 Bob Twin Peaks 2 Diana Dr. Strangelove 5 … Index Query: What items did Bob rate? Index of “Bob” points to rows 3 and 6
  • 50. Graph Operations • get_neighbors(): 1. Query indexed flat table 2. Join with vertex table on VertexID or Name Strata Conf, Feb 2014 User Movie Rating Bob Silver Linings Playbook 4 Bob Twin Peaks 2 VertexID Name Age Genre 3 Bob 33 100001 Silver Linings Playbook Romance 100003 Twin Peaks Thriller
  • 51. Graph Operations • get_subgraph(): • get_neighbors(), instantiate new table with subset of rows of old tables • Find edges/vertices with attribute = x • Filter old tables • Hypergraph – edges span more than 2 vertices • Just add more columns to the edge table Strata Conf, Feb 2014
  • 52. Back to Syslog Mining • Wait graph construction = search and filter • Iterate: • get_neighbors() • filter on edge and vertex attribute to find culprits • Sequential process • Underlying event graph is enormous • SLOW Strata Conf, Feb 2014
  • 53. ML on Graphs • Graphical models (Bayes nets) • Belief propagation • Gibbs sampling • Random walk on Markov chains • PageRank • Some algos are implementable on either • Matrix factorization Strata Conf, Feb 2014
  • 54. Graphs vs. Tables Strata Santa Clara, Feb 2014 Tables Graphs
  • 55. Graphs vs. Tables • Closely related • Graphs can be implemented on top of tables • … yet different • What key operations to optimize • How much to pre-compute • Indexes • Joins • Filters Strata Santa Clara, Feb 2014
  • 57. Flat Tables Strata Conf, Feb 2014 Random Access (In Memory) Sequential Access (On Disk) Querying (Interactive) Computation (Batch) Pandas Spark SQL Hive/Pig GraphLab SFrame
  • 58. Graphs Strata Conf, Feb 2014 Random Access (In-Memory) Sequential Access (On disk) Querying (Interactive) Computation (Batch) GraphLab Graph GraphChi Graph GraphDBs: HyperGraphDB, Titan, Neo4j Giraph
  • 59. Conclusions • Fast and scalable analysis hinges upon efficient data structures • Match the algo to the data structure • Morph raw data into the data structure Strata Conf, Feb 2014 Raw Data Data Structure Algorithm Insight
  • 60. Advertising • GraphLab Tutorial this afternoon! • “Large Scale Machine Learning Cookbook Using GraphLab” • Ballroom G, 1:30pm—5pm Strata Santa Clara, Feb 2014

Editor's Notes

  1. In order to understand the problems involved in Big Data Analysis, we have to switch from a learning- or modeling-centric view to the data-centric view.