Introduction to Big Data
Sri Kanajan
Big Data
• When data is too VVV (volume, variety, velocity) to manage with traditional
RDBMS, then you enter BIG DATA!
• D...
Outline
• What is Big Data?
• Why is this important now?
• Key Concepts
– Hadoop ,MapReduce – Storage, Processing
– Machin...
Big Data Everywhere!
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
gr...
How much data?
• Google processes 20 PB a day (2008)
• Wayback Machine has 3 PB + 100 TB/month (3/2009)
• Facebook has 2.5...
Type of Data
• Relational Data (Tables/Transaction/Legacy Data)
• Unstructured Text Data
– Log data, Comments, User genera...
What does Big Data Give You?
• Without Big Data
– Many data warehouses that were separate and on non distributed
architect...
Examples
• Norwegian Food Safety Authority
– accumulates data on all farm animals
– birth, death, movements, medication, s...
Big Data
Power of Distribution
45 Minutes! 4.5 Minutes!
Outline
• What is Big Data?
• Why is this important now?
• Key Concepts
– Hadoop ,MapReduce – Storage, Processing
– Machin...
Hadoop
• A framework that allows for distributed
processing of large data sets across clusters of
commodity computers usin...
MapReduce
• Programming model on top of Hadoop
• Basic concept is to provide a programming model that
immediately supports...
A Simple Example
• Counting words in a large set of documents
map(string value)
//key: document name
//value: document con...
MapReduce
Outline
• What is Big Data?
• Why is this important now?
• Key Concepts
– Hadoop, MapReduce – Storage architecture
– Machi...
Machine Learning
• Essentially ways to analyze data to extract
valuable information with or without training
data
– Predic...
Now you have an optimization
metric by which you can automate
the exploration of all possible
hypotheses !
Problems with t...
Two kinds of learning
21
• Supervised
– we have training data with correct answers
– use training data to prepare the algo...
Example: Collaborative Filtering
• Goal: predict what movies/books/… a person may be interested in,
on the basis of
– Past...
Outline
• What is Big Data?
• Why is this important now?
• Key Concepts
– Hadoop, MapReduce – Storage architecture
– Machi...
Is this an effective visual
representation?
Better Mapping? Why?
Diagrams Showing O-Ring Damage
that was Used to Decide to Launch
Challenger in 1987
Representation of the Same Data
Strategies to Increase the Information
Encoded by Spatial Position
• Composition
– Orthogonal placement of axes
– Creates ...
Strategies to Increase the Information
Encoded by Spatial Position
• Alignment
Folding
• Continuation of the Axes
Recursion
Overloading
Conclusion
• Big Data is a huge field that combines
expertise from different domains in order to
find interesting informat...
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
Upcoming SlideShare
Loading in …5
×

Big data Intro - Presentation to OCHackerz Meetup Group

363 views

Published on

Introduction to Big Data
Hadoop ,MapReduce – Storage, Processing
Machine Learning – Analytics
Visualization

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
363
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Big data Intro - Presentation to OCHackerz Meetup Group

  1. 1. Introduction to Big Data Sri Kanajan
  2. 2. Big Data • When data is too VVV (volume, variety, velocity) to manage with traditional RDBMS, then you enter BIG DATA! • Data Storage and Manipulation, at Scale – MapReduce, Hadoop, relationship to databases (Framework) – Key-value stores and NoSQL; tradeoffs of SQL and NoSQL (Database type) – Entity resolution, record linkage, data cleaning (data integration) • Analytics (Machine Learning) – Basic statistical modeling, experiment design, overfitting – Supervised learning: overview, simple nearest neighbor, decision trees/forests, regression – Unsupervised learning: k-means, multi-dimensional scaling – Graph Analytics: PageRank, community detection, recursive queries, iterative processing – Text Analytics: latent semantic analysis – Collaborative Filtering: slope-one • Communicating Results – Visualization, data products, visual data analytics
  3. 3. Outline • What is Big Data? • Why is this important now? • Key Concepts – Hadoop ,MapReduce – Storage, Processing – Machine Learning – Analytics – Visualization
  4. 4. Big Data Everywhere! • Lots of data is being collected and warehoused – Web data, e-commerce – purchases at department/ grocery stores – Bank/Credit Card transactions – Social Network Unknown Hidden Relationships within this Data !!!
  5. 5. How much data? • Google processes 20 PB a day (2008) • Wayback Machine has 3 PB + 100 TB/month (3/2009) • Facebook has 2.5 PB of user data + 15 TB/day (4/2009) • eBay has 6.5 PB of user data + 50 TB/day (5/2009) • CERN’s Large Hydron Collider (LHC) generates 15 PB a year 640K ought to be enough for anybody.
  6. 6. Type of Data • Relational Data (Tables/Transaction/Legacy Data) • Unstructured Text Data – Log data, Comments, User generated text • Semi-structured Data (XML) • Graph Data – Social Network, Semantic Web (RDF) • Real time Data – You can only scan the data once and need to do analytics quickly
  7. 7. What does Big Data Give You? • Without Big Data – Many data warehouses that were separate and on non distributed architectures – Had to modify data structures and unique programming to merge databases together – Scaling database size is a continual problem – Any large scale analytics took days and weeks and large coordination effort within IT to get database accesses – Data analysis is a large effort and lots of data tend to remain unanalyzed or even worse not stored • With Big Data – Hadoop provides a single view of all databases that can be distributed – Database size is a non issue – Ability to perform advanced statistical analysis on very large datasets very quickly – Data analysis is the competitive edge for many companies since barriers of entry are continually dropping through the development of platforms
  8. 8. Examples • Norwegian Food Safety Authority – accumulates data on all farm animals – birth, death, movements, medication, samples, ... • Hafslund – time series from hydroelectric dams, power prices, meters of individual customers, ... • Social Security Administration – data on individual cases, actions taken, outcomes... • Statoil – massive amounts of data from oil exploration, operations, logistics, engineering, ... • Retailers – see Target example above – also, connection between what people buy, weather forecast, logistics, ...
  9. 9. Big Data
  10. 10. Power of Distribution 45 Minutes! 4.5 Minutes!
  11. 11. Outline • What is Big Data? • Why is this important now? • Key Concepts – Hadoop ,MapReduce – Storage, Processing – Machine Learning – Analytics – Visualization
  12. 12. Hadoop • A framework that allows for distributed processing of large data sets across clusters of commodity computers using a simple programming model (I.e. MapReduce) – Distributed data processing – Works with structured and unstructured data – Open source – Master-slave architecture – Fault tolerant using commodity hardware
  13. 13. MapReduce • Programming model on top of Hadoop • Basic concept is to provide a programming model that immediately supports parallel processing (SQL on the other hand does not natively encourage parallel processing) • Pig is a framework and programming language to develop MapReduce • Note – MapReduce is great for extremely large data sets with simple relations. SQL is great for medium size data sets but with complex relationships – I.e. you have to decide the right technology depending on your problem space
  14. 14. A Simple Example • Counting words in a large set of documents map(string value) //key: document name //value: document contents for each word w in value EmitIntermediate(w, “1”); reduce(string key, iterator values) //key: word //values: list of counts int results = 0; for each v in values result += ParseInt(v); Emit(AsString(result));
  15. 15. MapReduce
  16. 16. Outline • What is Big Data? • Why is this important now? • Key Concepts – Hadoop, MapReduce – Storage architecture – Machine Learning – Analytics – Visualization
  17. 17. Machine Learning • Essentially ways to analyze data to extract valuable information with or without training data – Prediction • predicting a variable from data – Classification • assigning records to predefined groups – Clustering • splitting records into groups based on similarity – Association learning • seeing what often appears together with what – And many others….
  18. 18. Now you have an optimization metric by which you can automate the exploration of all possible hypotheses ! Problems with this approach??
  19. 19. Two kinds of learning 21 • Supervised – we have training data with correct answers – use training data to prepare the algorithm – then apply it to data without a correct answer • Unsupervised – no training data – throw data into the algorithm, hope it makes some kind of sense out of the data
  20. 20. Example: Collaborative Filtering • Goal: predict what movies/books/… a person may be interested in, on the basis of – Past preferences of the person – Other people with similar past preferences – The preferences of such people for a new movie/book/… • One approach based on repeated clustering – Cluster people on the basis of preferences for movies – Then cluster movies on the basis of being liked by the same clusters of people – Again cluster people based on their preferences for (the newly created clusters of) movies – Repeat above till equilibrium • Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find information of interest 22
  21. 21. Outline • What is Big Data? • Why is this important now? • Key Concepts – Hadoop, MapReduce – Storage architecture – Machine Learning – Analytics – Visualization
  22. 22. Is this an effective visual representation?
  23. 23. Better Mapping? Why?
  24. 24. Diagrams Showing O-Ring Damage that was Used to Decide to Launch Challenger in 1987
  25. 25. Representation of the Same Data
  26. 26. Strategies to Increase the Information Encoded by Spatial Position • Composition – Orthogonal placement of axes – Creates a 2D metric space
  27. 27. Strategies to Increase the Information Encoded by Spatial Position • Alignment
  28. 28. Folding • Continuation of the Axes
  29. 29. Recursion
  30. 30. Overloading
  31. 31. Conclusion • Big Data is a huge field that combines expertise from different domains in order to find interesting information from data • Extracting interesting information from data is the next competitive edge for many companies as information becomes available, instantly anywhere

×