Hadoop for Data Science
Donald Miner
Data Science MD
October 9, 2013
About Don
@donaldpminer
dminer@clearedgeit.com
I’ll talk about…
Intro to Hadoop (HDFS and MapReduce)
Some reasons why I think Hadoop is cool
(is this cliché yet?)
Step 1...
Hadoop Distributed File System
HDFS
• Stores files in folders (that’s it)
– Nobody cares what’s in your files
• Chunks lar...
MapReduce
• Analyzes raw data in HDFS where the data is
• Jobs are split into Mappers and Reducers
Reducers (you code this...
Hadoop Ecosystem
• Higher-level languages like Pig and Hive
• HDFS Data systems like HBase and Accumulo
• Close friends li...
Mahout
• Mahout is a Machine
Library
• Has both MapReduce
and non-parallel
implementations of a
number of algorithms:
– Re...
Cool Thing #1: Linear Scalability
• HDFS and MapReduce
scale linearly
• If you have twice as
many computers, jobs
run twic...
Cool Thing #2: Schema on Read
LOAD DATA FIRST, ASK QUESTIONS LATER
Data is parsed/interpreted as it is loaded out of HDFS
...
Cool Thing #3: Transparent Parallelism
Network programming?
Inter-process communication?
Threading?
Distributed stuff?
Wit...
Cool Thing #4: Unstructured Data
• Unstructured data:
media, text,
forms, log data
lumped structured data
• Query language...
The rest of the talk
• Four threads:
– Data exploration
– Classification
– NLP
– Recommender systems
I’m using these to il...
Exploration
• Hadoop is great at exploring data!
• I like to explore data in a couple ways:
– Filtering
– Sampling
– Summa...
Filtering
• Filtering is like a microscope:
I want to take a closer look at a subset
• In MapReduce, you do this in the ma...
Sampling
• Hadoop isn’t the king of interactive analysis
• Sampling is a good way to grab a set of data
then work with it ...
Summarization
• Summarization is a bird’s-eye view
• MapReduce is good at summarization:
– Mappers extract the group-by ke...
Evaluating Cleanliness
• I’ve never been burned twice
• Things to check for:
– Fields that shouldn’t be null that are
– Du...
What’s the point?
• Hadoop is really good at this stuff!
• You probably have a lot of data and a lot of it
is garbage!
• T...
Classification
• Classification is taking feature vectors (derived from
your data), and then guessing some sort of label
–...
Overall Classification Workflow
EXPLORATION EXPERIMENTATION
WITH DIFFERENT METHODS
REFINING PROMISING
METHODS
The Model Tr...
Data volumes in training
DATAVOLUME
DATA
I have a lot of data
Data volumes in training
DATAVOLUME
DATA
FEATURE
VECTORS
feature extraction
Is this result “big data”?
Examples:
- 10TB of...
Data volumes in training
DATAVOLUME
DATA
FEATURE
VECTORS
feature extraction Model
Training
MODEL
The model itself is usual...
Data volumes in training
DATAVOLUME
DATA
FEATURE
VECTORS
feature extraction Model
Training
MODEL
Applying that model to al...
Some hurdles
• Where do I run non-hadoop code?
• How do I host out results to the application?
• How do I use my model on ...
So what’s the point?
• Not all stages of the model training workflow
are Hadoop problems
• Use the right tool for the job ...
Natural Language Processing
• A lot of classic tools in NLP are “embarrassingly
parallel” over an entire corpus since word...
Python, NLTK, and Pig
• Pig is a higher-level abstract over MapReduce
• NLTK is a popular natural language toolkit for Pyt...
OpenNLP and MapReduce
• OpenNLP is an Apache project is an NLP library
• “It supports the most common NLP tasks, such as
t...
So what’s the point?
• Hadoop can be used to glue together already
existing libraries
– You just have to figure out how to...
Recommender Systems
• Hadoop is good at recommender systems
– Recommender systems like a lot of data
– Systems want to mak...
Collaborative Filtering:
Base recommendations on others
• Collaborative Filtering
is cool because it
doesn’t have to
under...
What’s the point?
• Recommender systems parallelize and there is
a Hadoop library for it
• They use relationships, not fea...
Other stuff: Graphs
• Graphs are useful and a lot can be done with
Hadoop
• Check out Giraph
• Check out how Accumulo has ...
Other stuff: Clustering
• Provides interesting insight into group
• Some methods parallelize well
• Mahout has:
– Dirichle...
Other stuff: R and Hadoop
• RHIPE and Rhadoop allow you to write
MapReduce jobs in R, instead of Java
• Can also use Hadoo...
Wrap up
Hadoop can’t do everything and
you have to do the rest
THANKS!
dminer@clearedgeit.com
@donaldpminer
Upcoming SlideShare
Loading in...5
×

Hadoop for Data Science

1,212

Published on

This is a talk I gave at Data Science MD meetup. It was based on the talk I gave about a month before at Data Science NYC (http://www.slideshare.net/DonaldMiner/data-scienceandhadoop). I talk about data exploration, NLP, Classifiers, and recommendation systems, plus some other things. I tried to depict a realistic view of Hadoop here.

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,212
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
48
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • Donald's talk will cover how to use native MapReduce in conjunction with Pig, including a detailed discussion of when users might be best served to use one or the other.
  • Hadoop for Data Science

    1. 1. Hadoop for Data Science Donald Miner Data Science MD October 9, 2013
    2. 2. About Don @donaldpminer dminer@clearedgeit.com
    3. 3. I’ll talk about… Intro to Hadoop (HDFS and MapReduce) Some reasons why I think Hadoop is cool (is this cliché yet?) Step 1: Hadoop Step 2: ???? Step 3: Data Science! Some examples of data science work on Hadoop What can Hadoop do to enable data science work?
    4. 4. Hadoop Distributed File System HDFS • Stores files in folders (that’s it) – Nobody cares what’s in your files • Chunks large files into blocks (~64MB-2GB) • 3 replicates of each block (better safe than sorry) • Blocks are scattered all over the place FILE BLOCKS
    5. 5. MapReduce • Analyzes raw data in HDFS where the data is • Jobs are split into Mappers and Reducers Reducers (you code this, too) Automatically Groups by the mapper’s output key Aggregate, count, statistics Outputs to HDFS Mappers (you code this) Loads data from HDFS Filter, transform, parse Outputs (key, value) pairs
    6. 6. Hadoop Ecosystem • Higher-level languages like Pig and Hive • HDFS Data systems like HBase and Accumulo • Close friends like ZooKeeper, Flume, Storm, Cassandra, Avro
    7. 7. Mahout • Mahout is a Machine Library • Has both MapReduce and non-parallel implementations of a number of algorithms: – Recommenders – Clustering – Classification
    8. 8. Cool Thing #1: Linear Scalability • HDFS and MapReduce scale linearly • If you have twice as many computers, jobs run twice as fast • If you have twice as much data, jobs run twice as slow • If you have twice as many computers, you can store twice as much data DATA LOCALITY!!
    9. 9. Cool Thing #2: Schema on Read LOAD DATA FIRST, ASK QUESTIONS LATER Data is parsed/interpreted as it is loaded out of HDFS What implications does this have? BEFORE: ETL, schema design upfront, tossing out original data, comprehensive data study Keep original data around! Have multiple views of the same data! Work with unstructured data sooner! Store first, figure out what to do with it later! WITH HADOOP:
    10. 10. Cool Thing #3: Transparent Parallelism Network programming? Inter-process communication? Threading? Distributed stuff? With MapReduce, I DON’T CARE Your solution … I just have to be sure my solution fits into this tiny box Fault tolerance? Code deployment? RPC? Message passing? Locking? MapReduce Framework Data storage? Scalability? Data center fires?
    11. 11. Cool Thing #4: Unstructured Data • Unstructured data: media, text, forms, log data lumped structured data • Query languages like SQL and Pig assume some sort of “structure” • MapReduce is just Java: You can do anything Java can do in a Mapper or Reducer
    12. 12. The rest of the talk • Four threads: – Data exploration – Classification – NLP – Recommender systems I’m using these to illustrate some points
    13. 13. Exploration • Hadoop is great at exploring data! • I like to explore data in a couple ways: – Filtering – Sampling – Summarization – Evaluate cleanliness • I like to spend 50% of my time doing exploration (but unfortunately it’s the first thing to get cut)
    14. 14. Filtering • Filtering is like a microscope: I want to take a closer look at a subset • In MapReduce, you do this in the mapper • Identify nasty records you want to get rid of • Examples: – Only Baltimore data – Remove gibberish – Only 5 minutes
    15. 15. Sampling • Hadoop isn’t the king of interactive analysis • Sampling is a good way to grab a set of data then work with it locally (Excel?) • Pig has a handy SAMPLE keyword • Types of sampling: – Sample randomly across the entire data set – Sub-graph extraction – Filters (from the last slide)
    16. 16. Summarization • Summarization is a bird’s-eye view • MapReduce is good at summarization: – Mappers extract the group-by keys – Reducers do the aggregation • I like to: – Count number, get stdev, get average, get min/max of records in several groups – Count nulls in columns (if applicable) – Grab top-10 lists
    17. 17. Evaluating Cleanliness • I’ve never been burned twice • Things to check for: – Fields that shouldn’t be null that are – Duplicates (does unique records=records?) – Dates (look for 1970; look at formats; time zones) – Things that should be normalized – Keys that are different because of trash e.g. “ abc “ != “abc” SQL/RDBMS =
    18. 18. What’s the point? • Hadoop is really good at this stuff! • You probably have a lot of data and a lot of it is garbage! • Take the time to do this and your further work will be much easier • It’s hard to tell what methods you should use until you explore your data
    19. 19. Classification • Classification is taking feature vectors (derived from your data), and then guessing some sort of label – E.g., sunny, Saturday, summer -> play tennis rainy, Wednesday, winter -> don’t play tennis • Most classification algorithms aren’t easily parallelizable or have good implementations • You need a training set of true feature vectors and labels… how often is your data labeled? • I’ve found classification rather hard, except for when…
    20. 20. Overall Classification Workflow EXPLORATION EXPERIMENTATION WITH DIFFERENT METHODS REFINING PROMISING METHODS The Model Training Workflow FEATURE EXTRACTION MODEL TRAINING USE MODEL DATA FEATURE VECTORS MODEL OUTPUT
    21. 21. Data volumes in training DATAVOLUME DATA I have a lot of data
    22. 22. Data volumes in training DATAVOLUME DATA FEATURE VECTORS feature extraction Is this result “big data”? Examples: - 10TB of network traffic distilled into 9K IP address FVs - 10TB of medical records distilled into 50M patient FVs - 10TB of documents distilled into 5TB of document FVs
    23. 23. Data volumes in training DATAVOLUME DATA FEATURE VECTORS feature extraction Model Training MODEL The model itself is usually pretty tiny
    24. 24. Data volumes in training DATAVOLUME DATA FEATURE VECTORS feature extraction Model Training MODEL Applying that model to all the data is a big data problem!
    25. 25. Some hurdles • Where do I run non-hadoop code? • How do I host out results to the application? • How do I use my model on streaming data? • Automate performance measurement?
    26. 26. So what’s the point? • Not all stages of the model training workflow are Hadoop problems • Use the right tool for the job in each phase e.g., non-parallel model training in some cases FEATURE EXTRACTION MODEL TRAINING USE MODEL DATA FEATURE VECTORS MODEL OUTPUT
    27. 27. Natural Language Processing • A lot of classic tools in NLP are “embarrassingly parallel” over an entire corpus since words split nicely. – Stemming – Lexical analysis – Parsing – Tokenization – Normalization – Removing stop words – Spell check Each of these apply to segments of text and don’t have much to do with any other piece of Text in the corpus.
    28. 28. Python, NLTK, and Pig • Pig is a higher-level abstract over MapReduce • NLTK is a popular natural language toolkit for Python • Pig allows you to stream data through arbitrary processes (including python scripts) • You can use UDFs to wrap NLTK methods, but the need to use Jython sucks • Use Pig to move your data around, use a real package to do the work on the records postdata = STREAM data THROUGH `my_nltk_script.py`; (I do the same thing with Scipy and Numpy)
    29. 29. OpenNLP and MapReduce • OpenNLP is an Apache project is an NLP library • “It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of- speech tagging, named entity extraction, chunking, parsing, and coreference resolution.” • Written in Java with reasonable APIs • MapReduce is just Java, so you can link into just about anything you want • Use OpenNLP in the Mapper to enrich, normalize, cleanse your data
    30. 30. So what’s the point? • Hadoop can be used to glue together already existing libraries – You just have to figure out how to split the problem up yourself
    31. 31. Recommender Systems • Hadoop is good at recommender systems – Recommender systems like a lot of data – Systems want to make a lot of recommendations • A number of methods available in Mahout
    32. 32. Collaborative Filtering: Base recommendations on others • Collaborative Filtering is cool because it doesn’t have to understand the user or the item… just the relationships • Relationships are easy to extract, features and labels not so much
    33. 33. What’s the point? • Recommender systems parallelize and there is a Hadoop library for it • They use relationships, not features, so the input data is easier to extract • If you can fit your problem into the recommendation framework, you can do something interesting
    34. 34. Other stuff: Graphs • Graphs are useful and a lot can be done with Hadoop • Check out Giraph • Check out how Accumulo has been used to store graphs (google: “Graph 500 Accumulo”) • Stuff to do: – Subgraph extraction – Missing edge recommendation – Cool visualizations – Summarizing relationships
    35. 35. Other stuff: Clustering • Provides interesting insight into group • Some methods parallelize well • Mahout has: – Dirichlet process – K-means – Fuzzy K-means
    36. 36. Other stuff: R and Hadoop • RHIPE and Rhadoop allow you to write MapReduce jobs in R, instead of Java • Can also use Hadoop streaming to use R • This doesn’t magically parallelize all your R code • Useful to integrate into R more seamlessly
    37. 37. Wrap up Hadoop can’t do everything and you have to do the rest
    38. 38. THANKS! dminer@clearedgeit.com @donaldpminer
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×