SlideShare a Scribd company logo
Hadoop for Data Science
Donald Miner
NYC Pig User Group
August 22, 2013
About Don
@donaldpminer
dminer@clearedgeit.com
I’ll talk about…
Intro to Hadoop
Some reasons why I think Hadoop is cool
(is this cliché yet?)
Step 1: Hadoop
Step 2: ????
Step 3: Data Science!
Some examples of data science work on hadoop
What can Hadoop do to enable data science work?
Hadoop
• Distributed platform for thousands of nodes
• Data storage and computation framework
• Open source
• Runs on commodity hardware
Hadoop Distributed File System
HDFS
• Stores files in folders (that’s it)
– Nobody cares what’s in your files
• Chunks large files into blocks (~64MB-2GB)
• 3 replicates of each block (better safe than sorry)
• Blocks are scattered all over the place
FILE BLOCKS
MapReduce
• Analyzes raw data in HDFS where the data is
• Jobs are split into Mappers and Reducers
Reducers (you code this, too)
Automatically Groups by the
mapper’s output key
Aggregate, count, statistics
Outputs to HDFS
Mappers (you code this)
Loads data from HDFS
Filter, transform, parse
Outputs (key, value) pairs
Hadoop Ecosystem
• Higher-level languages like Pig and Hive
• HDFS Data systems like HBase and Accumulo
• Close friends like ZooKeeper, Flume, Storm,
Cassandra, Avro
Pig
• Pig is a fantastic query language that runs MapReduce
jobs
• Higher-level than MapReduce: write code in terms of
GROUP BY, DISTINCT, FOREACH, FILTER, etc.
• Custom loaders and storage functions make this good
glue
• I use this a lot
A = LOAD ‘data.txt’
AS (name:chararray, age:int, state:chararray);
B = GROUP A BY state;
C = FOREACH B GENERATE group, COUNT(*), AVG(age);
dump c;
Mahout
• Mahout is a Machine
Library
• Has both parallel and
non-parallel
implementations of a
number of algorithms:
– Recommenders
– Clustering
– Classification
Cool Thing #1: Linear Scalability
• HDFS and MapReduce
scale linearly
• If you have twice as
many computers, jobs
run twice as fast
• If you have twice as
much data, jobs run
twice as slow
• If you have twice as
many computers, you
can store twice as much
data
DATA LOCALITY!!
Cool Thing #2: Schema on Read
LOAD DATA FIRST, ASK QUESTIONS LATER
Data is parsed/interpreted as it is loaded out of HDFS
What implications does this have?
BEFORE:
ETL, schema design upfront,
tossing out original data,
comprehensive data study
Keep original data around!
Have multiple views of the same data!
Work with unstructured data sooner!
Store first, figure out what to do with it later!
WITH HADOOP:
Cool Thing #3: Transparent Parallelism
Network programming?
Inter-process communication?
Threading?
Distributed stuff?
With MapReduce, I DON’T CARE
Your solution
… I just have to fit my solution into this tiny box
Fault tolerance?
Code deployment?
RPC?
Message passing?
Locking?
MapReduce
Framework
Data storage?
Scalability?
Data center fires?
Cool Thing #4: Unstructured Data
• Unstructured data:
media, text,
forms, log data
lumped structured data
• Query languages like SQL
and Pig assume some sort
of “structure”
• MapReduce is just Java:
You can do anything Java can
do in a Mapper or Reducer
One of the things Hadoop can do for you is turn your unstructured data into structured
The rest of the talk
• Four threads:
– Data exploration
– Classification
– NLP
– Recommender systems
I’m using these to illustrate some points
Exploration
• Hadoop is great at exploring data!
• I like to explore data in a couple ways:
– Filtering
– Sampling
– Summarization
– Evaluate cleanliness
• I like to spend 50% of my time
doing exploration
(but unfortunately it’s the
first thing to get cut)
Filtering
• Filtering is like a microscope:
I want to take a closer look at a subset
• In MapReduce, you do this in the mapper
• Identify nasty records you want to get rid of
• Examples:
– Only new york data
– Only millennials
– Remove gibberish
– Only 5 minutes
Sampling
• Hadoop isn’t the king of interactive analysis
• Sampling is a good way to grab a set of data
then work with it locally (Excel?)
• Pig has a handy SAMPLE keyword
• Types of sampling:
– Sample randomly across the entire data set
– Sub-graph extraction
– Filters (from the last slide)
Summarization
• Summarization is a bird’s-eye view
• MapReduce is good at summarization:
– Mappers extract the group-by keys
– Reducers do the aggregation
• I like to:
– Count number, get stdev, get average, get min/max of
records in several groups
– Count nulls in columns
(if applicable)
– Grab top-10 lists
Evaluating Cleanliness
• I’ve never been burned twice:
– There are a list of things that I like to check
• Things to check for:
– Fields that shouldn’t be null that are
– Duplicates (does unique records=records?)
– Dates (look for 1970; look at formats; time zones)
– Things that should be normalized
– Keys that are different because of trash
e.g. “ abc “ != “abc”
What’s the point?
• Hadoop is really good at this stuff!
• You probably have a lot of data and a lot of it
is garbage!
• Take the time to do this and your further work
will be much easier
• It’s hard to tell what methods
you should use until you
explore your data
Classification
• Classification is taking feature vectors (derived from
your data), and then guessing some sort of label
– E.g.,
sunny, Saturday, summer -> play tennis
rainy, Wednesday, winter -> don’t play tennis
• Most classification algorithms aren’t easily
parallelizable or have good implementations
• You need a training set of true feature vectors and
labels… how often is your data labeled?
• I’ve found classification rather hard, except for when…
Overall Classification Workflow
EXPLORATION EXPERIMENTATION
OF DIFFERENT METHODS
REFINING PROMISING
METHODS
The Model Training Workflow
FEATURE
EXTRACTION
MODEL
TRAINING USE MODEL
DATA FEATURE
VECTORS
MODEL OUTPUT
Data volumes in training
DATAVOLUME
DATA
I have a lot of data
Data volumes in training
DATAVOLUME
DATA
FEATURE
VECTORS
feature extraction
Is this result “big data”?
Examples:
- 10TB of network traffic distilled into 9K IP address FVs
- 10TB of medical records distilled into 50M patient FVs
- 10TB of documents distilled into 5TB of document FVs
Data volumes in training
DATAVOLUME
DATA
FEATURE
VECTORS
feature extraction Model
Training
MODEL
The model itself is usually pretty tiny
Data volumes in training
DATAVOLUME
DATA
FEATURE
VECTORS
feature extraction Model
Training
MODEL
Applying that model to all the
data is a big data problem!
Some hurdles
• Where do I run non-hadoop code?
• How do I host out results to the application?
• How do I use my model on streaming data?
• Automate performance measurement
Miscellaneous:
Train all the classifiers!
Training a classifier might not be a big data problem…
… but training lots of them is!
Examples:
Train a model per user to detect anomalous events
Train a Boolean model per label possibility
Ensemble methods
So what’s the point?
• Not all stages of the model training workflow
are Hadoop problems
• Use the right tool for the job in each phase
e.g., non-parallel model training in some cases
FEATURE
EXTRACTION
MODEL
TRAINING USE MODEL
DATA FEATURE
VECTORS
MODEL OUTPUT
Natural Language Pre-Processing
• A lot of classic tools in NLP are “embarrassingly
parallel”
– Stemming
– Lexical analysis
– Parsing
– Tokenization
– Normalization
– Removing stop words
– Spell check
Each of these apply to segments of text and
don’t have much to do with any other piece of
Text in the corpus.
Python, NLTK, and Pig
• Pig is a higher-level abstract over MapReduce
• NLTK is a popular natural language toolkit for Python
• Pig allows you to stream data through arbitrary
processes (including python scripts)
• You can use UDFs to wrap NLTK methods, but the need
to use Jython sucks
• Use Pig to move your data around, use a real package
to do the work on the records
postdata = STREAM data THROUGH `my_nltk_script.py`;
(I do the same thing with Scipy and Numpy)
OpenNLP and MapReduce
• OpenNLP is an Apache project is an NLP library
• “It supports the most common NLP tasks, such as
tokenization, sentence segmentation, part-of-
speech tagging, named entity extraction,
chunking, parsing, and coreference resolution.”
• Written in Java with reasonable APIs
• MapReduce is just Java, so you can link into just
about anything you want
• Use OpenNLP in the Mapper to enrich, normalize,
cleanse your data
One of my favorites: TF-IDF
• TF-IDF (Term Frequency, Inverse Document
Frequency)
– TF: how common is the word in the document
– IDF: how common is this word everywhere
(inverse)
– Multiply both and get a score for each term
• Easily pulls out topics in documents (or lack of
topics)
• Parallelizable (examples online)
Example: The quick brown fox jumps over the lazy dog
Somewhat related: Text extraction
• Extracting text with OCR or Speech-to-text (for
example) can be an expensive operation
• Use Hadoop’s parallelism to apply your
method against a large corpus of data
• You can’t really make individual extraction
faster, but you can make the overall process
faster
So what’s the point?
• Hadoop can be used to glue together already
existing libraries
– You just have to figure out how to split the
problem up yourself
• Utilize a lot of the NLP toolkits to process text
Recommender Systems
• Hadoop is good at recommender systems
– Recommender systems like a lot of data
– Systems want to make a lot of recommendations
• A number of methods available in Mahout
• I’ll be talking about Collaborative Filtering
1. Find similar users
2. Make recommendations based on those
I have no idea what I’m doing
• Collaborative Filtering is cool because it
doesn’t have to understand the user or the
item… just the relationships
• Relationships are easy to extract, features and
labels not so much
• Features can be folded into the similarity
metrics
What’s the point?
• Recommender systems parallelize and there is
a Hadoop library for it
• They use relationships, not features, so the
data is easier to extract
• If you can fit your problem into the
recommendation framework, you can do
something interesting
Other stuff: Graphs
• Graphs are useful and a lot can be done with
Hadoop
• Check out Giraph
• Check out how Accumulo has been used to
store graphs (google: “Graph 500 Accumulo”)
• Stuff to do:
– Subgraph extraction
– Missing edge recommendation
– Cool visualizations
– Summarizing relationships
Other stuff: Clustering
• Provides interesting insight into group
• Some methods parallelize well
• Mahout has:
– Dirichlet process clustering
– K-means
– Fuzzy K-means
Other stuff: R and Hadoop
• RHIPE and Rhadoop allow you to write
MapReduce jobs in R, instead of Java
• Can also use Hadoop streaming to use R
• This doesn’t magically parallelize all your R
code
• Useful to integrate into R more seamlessly
Wrap up
• Hadoop is good at certain things
• Hadoop can’t do everything and you have to
do the rest
THANKS!
dminer@clearedgeit.com
@donaldpminer

More Related Content

What's hot

Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Simplilearn
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
Victor Sanchez Anguix
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
Krishna Sankar
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
Lars Marius Garshol
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
Sri Ambati
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013
MLconf
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
Sri Ambati
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Lucidworks
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
Anqi Fu
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
Arjen de Vries
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
Travis Oliphant
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
Travis Oliphant
 
MongoDB & Machine Learning
MongoDB & Machine LearningMongoDB & Machine Learning
MongoDB & Machine Learning
Tom Maiaroto
 
Seattle Scalability Mahout
Seattle Scalability MahoutSeattle Scalability Mahout
Seattle Scalability Mahout
Jake Mannix
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
Hortonworks
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
Arjen de Vries
 
Java Memory Analysis: Problems and Solutions
Java Memory Analysis: Problems and SolutionsJava Memory Analysis: Problems and Solutions
Java Memory Analysis: Problems and Solutions
"Mikhail "Misha"" Dmitriev
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku
 
H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka
H2O World - Benchmarking Open Source ML Platforms - Szilard PafkaH2O World - Benchmarking Open Source ML Platforms - Szilard Pafka
H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka
Sri Ambati
 

What's hot (20)

Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
MongoDB & Machine Learning
MongoDB & Machine LearningMongoDB & Machine Learning
MongoDB & Machine Learning
 
Seattle Scalability Mahout
Seattle Scalability MahoutSeattle Scalability Mahout
Seattle Scalability Mahout
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
Java Memory Analysis: Problems and Solutions
Java Memory Analysis: Problems and SolutionsJava Memory Analysis: Problems and Solutions
Java Memory Analysis: Problems and Solutions
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka
H2O World - Benchmarking Open Source ML Platforms - Szilard PafkaH2O World - Benchmarking Open Source ML Platforms - Szilard Pafka
H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka
 

Viewers also liked

Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
Caserta
 
Pig on Spark
Pig on SparkPig on Spark
Pig on Spark
mortardata
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
Vivian S. Zhang
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
Vivian S. Zhang
 
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Vivian S. Zhang
 
Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)
Vivian S. Zhang
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
Vivian S. Zhang
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
Vivian S. Zhang
 
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Vivian S. Zhang
 
Spatial query tutorial for nyc subway income level along subway
Spatial query tutorial  for nyc subway income level along subwaySpatial query tutorial  for nyc subway income level along subway
Spatial query tutorial for nyc subway income level along subway
Vivian S. Zhang
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
Vivian S. Zhang
 
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Vivian S. Zhang
 
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Vivian S. Zhang
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
Vivian S. Zhang
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
Vivian S. Zhang
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Vivian S. Zhang
 
Xgboost
XgboostXgboost
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
Vivian S. Zhang
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
Vivian S. Zhang
 

Viewers also liked (20)

Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
Pig on Spark
Pig on SparkPig on Spark
Pig on Spark
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
 
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
 
Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 
Spatial query tutorial for nyc subway income level along subway
Spatial query tutorial  for nyc subway income level along subwaySpatial query tutorial  for nyc subway income level along subway
Spatial query tutorial for nyc subway income level along subway
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
 
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
 
Xgboost
XgboostXgboost
Xgboost
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
 

Similar to Data science and Hadoop

Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
elephantscale
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
Sri Kanajan
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
Portland R User Group
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
Portland R User Group
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Cloudera, Inc.
 
BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data
Mindgrub Technologies
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
Ike Ellis
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
Elizabeth Smith
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
responseteam
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
Russell Jurney
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
Russell Jurney
 
DataIntensiveComputing.pdf
DataIntensiveComputing.pdfDataIntensiveComputing.pdf
DataIntensiveComputing.pdf
Brahmam8
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
Adaryl "Bob" Wakefield, MBA
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
Russell Jurney
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
Elizabeth Smith
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 

Similar to Data science and Hadoop (20)

Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
 
BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data
 
Bw tech hadoop
Bw tech hadoopBw tech hadoop
Bw tech hadoop
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
 
DataIntensiveComputing.pdf
DataIntensiveComputing.pdfDataIntensiveComputing.pdf
DataIntensiveComputing.pdf
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 

More from Donald Miner

Machine Learning Vital Signs
Machine Learning Vital SignsMachine Learning Vital Signs
Machine Learning Vital Signs
Donald Miner
 
Survey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing DataSurvey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing Data
Donald Miner
 
An Introduction to Accumulo
An Introduction to AccumuloAn Introduction to Accumulo
An Introduction to Accumulo
Donald Miner
 
SQL on Accumulo
SQL on AccumuloSQL on Accumulo
SQL on Accumulo
Donald Miner
 
Data, The New Currency
Data, The New CurrencyData, The New Currency
Data, The New CurrencyDonald Miner
 
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
Donald Miner
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
Donald Miner
 

More from Donald Miner (7)

Machine Learning Vital Signs
Machine Learning Vital SignsMachine Learning Vital Signs
Machine Learning Vital Signs
 
Survey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing DataSurvey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing Data
 
An Introduction to Accumulo
An Introduction to AccumuloAn Introduction to Accumulo
An Introduction to Accumulo
 
SQL on Accumulo
SQL on AccumuloSQL on Accumulo
SQL on Accumulo
 
Data, The New Currency
Data, The New CurrencyData, The New Currency
Data, The New Currency
 
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 

Recently uploaded

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 

Recently uploaded (20)

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 

Data science and Hadoop

  • 1. Hadoop for Data Science Donald Miner NYC Pig User Group August 22, 2013
  • 3. I’ll talk about… Intro to Hadoop Some reasons why I think Hadoop is cool (is this cliché yet?) Step 1: Hadoop Step 2: ???? Step 3: Data Science! Some examples of data science work on hadoop What can Hadoop do to enable data science work?
  • 4. Hadoop • Distributed platform for thousands of nodes • Data storage and computation framework • Open source • Runs on commodity hardware
  • 5. Hadoop Distributed File System HDFS • Stores files in folders (that’s it) – Nobody cares what’s in your files • Chunks large files into blocks (~64MB-2GB) • 3 replicates of each block (better safe than sorry) • Blocks are scattered all over the place FILE BLOCKS
  • 6. MapReduce • Analyzes raw data in HDFS where the data is • Jobs are split into Mappers and Reducers Reducers (you code this, too) Automatically Groups by the mapper’s output key Aggregate, count, statistics Outputs to HDFS Mappers (you code this) Loads data from HDFS Filter, transform, parse Outputs (key, value) pairs
  • 7. Hadoop Ecosystem • Higher-level languages like Pig and Hive • HDFS Data systems like HBase and Accumulo • Close friends like ZooKeeper, Flume, Storm, Cassandra, Avro
  • 8. Pig • Pig is a fantastic query language that runs MapReduce jobs • Higher-level than MapReduce: write code in terms of GROUP BY, DISTINCT, FOREACH, FILTER, etc. • Custom loaders and storage functions make this good glue • I use this a lot A = LOAD ‘data.txt’ AS (name:chararray, age:int, state:chararray); B = GROUP A BY state; C = FOREACH B GENERATE group, COUNT(*), AVG(age); dump c;
  • 9. Mahout • Mahout is a Machine Library • Has both parallel and non-parallel implementations of a number of algorithms: – Recommenders – Clustering – Classification
  • 10. Cool Thing #1: Linear Scalability • HDFS and MapReduce scale linearly • If you have twice as many computers, jobs run twice as fast • If you have twice as much data, jobs run twice as slow • If you have twice as many computers, you can store twice as much data DATA LOCALITY!!
  • 11. Cool Thing #2: Schema on Read LOAD DATA FIRST, ASK QUESTIONS LATER Data is parsed/interpreted as it is loaded out of HDFS What implications does this have? BEFORE: ETL, schema design upfront, tossing out original data, comprehensive data study Keep original data around! Have multiple views of the same data! Work with unstructured data sooner! Store first, figure out what to do with it later! WITH HADOOP:
  • 12. Cool Thing #3: Transparent Parallelism Network programming? Inter-process communication? Threading? Distributed stuff? With MapReduce, I DON’T CARE Your solution … I just have to fit my solution into this tiny box Fault tolerance? Code deployment? RPC? Message passing? Locking? MapReduce Framework Data storage? Scalability? Data center fires?
  • 13. Cool Thing #4: Unstructured Data • Unstructured data: media, text, forms, log data lumped structured data • Query languages like SQL and Pig assume some sort of “structure” • MapReduce is just Java: You can do anything Java can do in a Mapper or Reducer One of the things Hadoop can do for you is turn your unstructured data into structured
  • 14. The rest of the talk • Four threads: – Data exploration – Classification – NLP – Recommender systems I’m using these to illustrate some points
  • 15. Exploration • Hadoop is great at exploring data! • I like to explore data in a couple ways: – Filtering – Sampling – Summarization – Evaluate cleanliness • I like to spend 50% of my time doing exploration (but unfortunately it’s the first thing to get cut)
  • 16. Filtering • Filtering is like a microscope: I want to take a closer look at a subset • In MapReduce, you do this in the mapper • Identify nasty records you want to get rid of • Examples: – Only new york data – Only millennials – Remove gibberish – Only 5 minutes
  • 17. Sampling • Hadoop isn’t the king of interactive analysis • Sampling is a good way to grab a set of data then work with it locally (Excel?) • Pig has a handy SAMPLE keyword • Types of sampling: – Sample randomly across the entire data set – Sub-graph extraction – Filters (from the last slide)
  • 18. Summarization • Summarization is a bird’s-eye view • MapReduce is good at summarization: – Mappers extract the group-by keys – Reducers do the aggregation • I like to: – Count number, get stdev, get average, get min/max of records in several groups – Count nulls in columns (if applicable) – Grab top-10 lists
  • 19. Evaluating Cleanliness • I’ve never been burned twice: – There are a list of things that I like to check • Things to check for: – Fields that shouldn’t be null that are – Duplicates (does unique records=records?) – Dates (look for 1970; look at formats; time zones) – Things that should be normalized – Keys that are different because of trash e.g. “ abc “ != “abc”
  • 20. What’s the point? • Hadoop is really good at this stuff! • You probably have a lot of data and a lot of it is garbage! • Take the time to do this and your further work will be much easier • It’s hard to tell what methods you should use until you explore your data
  • 21. Classification • Classification is taking feature vectors (derived from your data), and then guessing some sort of label – E.g., sunny, Saturday, summer -> play tennis rainy, Wednesday, winter -> don’t play tennis • Most classification algorithms aren’t easily parallelizable or have good implementations • You need a training set of true feature vectors and labels… how often is your data labeled? • I’ve found classification rather hard, except for when…
  • 22. Overall Classification Workflow EXPLORATION EXPERIMENTATION OF DIFFERENT METHODS REFINING PROMISING METHODS The Model Training Workflow FEATURE EXTRACTION MODEL TRAINING USE MODEL DATA FEATURE VECTORS MODEL OUTPUT
  • 23. Data volumes in training DATAVOLUME DATA I have a lot of data
  • 24. Data volumes in training DATAVOLUME DATA FEATURE VECTORS feature extraction Is this result “big data”? Examples: - 10TB of network traffic distilled into 9K IP address FVs - 10TB of medical records distilled into 50M patient FVs - 10TB of documents distilled into 5TB of document FVs
  • 25. Data volumes in training DATAVOLUME DATA FEATURE VECTORS feature extraction Model Training MODEL The model itself is usually pretty tiny
  • 26. Data volumes in training DATAVOLUME DATA FEATURE VECTORS feature extraction Model Training MODEL Applying that model to all the data is a big data problem!
  • 27. Some hurdles • Where do I run non-hadoop code? • How do I host out results to the application? • How do I use my model on streaming data? • Automate performance measurement
  • 28. Miscellaneous: Train all the classifiers! Training a classifier might not be a big data problem… … but training lots of them is! Examples: Train a model per user to detect anomalous events Train a Boolean model per label possibility Ensemble methods
  • 29. So what’s the point? • Not all stages of the model training workflow are Hadoop problems • Use the right tool for the job in each phase e.g., non-parallel model training in some cases FEATURE EXTRACTION MODEL TRAINING USE MODEL DATA FEATURE VECTORS MODEL OUTPUT
  • 30. Natural Language Pre-Processing • A lot of classic tools in NLP are “embarrassingly parallel” – Stemming – Lexical analysis – Parsing – Tokenization – Normalization – Removing stop words – Spell check Each of these apply to segments of text and don’t have much to do with any other piece of Text in the corpus.
  • 31. Python, NLTK, and Pig • Pig is a higher-level abstract over MapReduce • NLTK is a popular natural language toolkit for Python • Pig allows you to stream data through arbitrary processes (including python scripts) • You can use UDFs to wrap NLTK methods, but the need to use Jython sucks • Use Pig to move your data around, use a real package to do the work on the records postdata = STREAM data THROUGH `my_nltk_script.py`; (I do the same thing with Scipy and Numpy)
  • 32. OpenNLP and MapReduce • OpenNLP is an Apache project is an NLP library • “It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of- speech tagging, named entity extraction, chunking, parsing, and coreference resolution.” • Written in Java with reasonable APIs • MapReduce is just Java, so you can link into just about anything you want • Use OpenNLP in the Mapper to enrich, normalize, cleanse your data
  • 33. One of my favorites: TF-IDF • TF-IDF (Term Frequency, Inverse Document Frequency) – TF: how common is the word in the document – IDF: how common is this word everywhere (inverse) – Multiply both and get a score for each term • Easily pulls out topics in documents (or lack of topics) • Parallelizable (examples online) Example: The quick brown fox jumps over the lazy dog
  • 34. Somewhat related: Text extraction • Extracting text with OCR or Speech-to-text (for example) can be an expensive operation • Use Hadoop’s parallelism to apply your method against a large corpus of data • You can’t really make individual extraction faster, but you can make the overall process faster
  • 35. So what’s the point? • Hadoop can be used to glue together already existing libraries – You just have to figure out how to split the problem up yourself • Utilize a lot of the NLP toolkits to process text
  • 36. Recommender Systems • Hadoop is good at recommender systems – Recommender systems like a lot of data – Systems want to make a lot of recommendations • A number of methods available in Mahout • I’ll be talking about Collaborative Filtering 1. Find similar users 2. Make recommendations based on those
  • 37. I have no idea what I’m doing • Collaborative Filtering is cool because it doesn’t have to understand the user or the item… just the relationships • Relationships are easy to extract, features and labels not so much • Features can be folded into the similarity metrics
  • 38. What’s the point? • Recommender systems parallelize and there is a Hadoop library for it • They use relationships, not features, so the data is easier to extract • If you can fit your problem into the recommendation framework, you can do something interesting
  • 39. Other stuff: Graphs • Graphs are useful and a lot can be done with Hadoop • Check out Giraph • Check out how Accumulo has been used to store graphs (google: “Graph 500 Accumulo”) • Stuff to do: – Subgraph extraction – Missing edge recommendation – Cool visualizations – Summarizing relationships
  • 40. Other stuff: Clustering • Provides interesting insight into group • Some methods parallelize well • Mahout has: – Dirichlet process clustering – K-means – Fuzzy K-means
  • 41. Other stuff: R and Hadoop • RHIPE and Rhadoop allow you to write MapReduce jobs in R, instead of Java • Can also use Hadoop streaming to use R • This doesn’t magically parallelize all your R code • Useful to integrate into R more seamlessly
  • 42. Wrap up • Hadoop is good at certain things • Hadoop can’t do everything and you have to do the rest

Editor's Notes

  1. Donald's talk will cover how to use native MapReduce in conjunction with Pig, including a detailed discussion of when users might be best served to use one or the other.