Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big data analytics 1


Published on

This presentation explains basics of Big Data and later it focuses on big data analytics part.

Published in: Education, Technology

Big data analytics 1

  1. 1. By Gaurav Chauhan (121060753005) Guided By Prof Rajesh Ingle Pune Institute of Computing Technology
  2. 2.  Understanding  Do we know Big Data?  What is Big Data?  Where is Big Data coming from ?  Uses Of Big Data?  Technology  Big data in action  Big Data analytics Technologies
  3. 3.  Data : Collected Facts.  Information :  Derived meaning from data.  Meaning full data Source : Any book of database…..
  4. 4.  Big Data is not new.  It just grown bigger that we started noticing it.  Its same old small chunks of data in large volumes.  Big Data is not only about  Larger Volume of Data  Unmanaged data  Only for Social Media  Than what is it?
  5. 5. Data Sources Analytics Web logs, Click Streams ERP, CRM RSS Feeds Social N/Ws Process Pre process Capture Store Integrate Hadoop Cluster Map Transform Clean Analytical Data Storage Reports, Scorecards Forecasting SQL Queries Real Time Systems
  6. 6.  Big data is the new way to see through the data what we already have.  It is the way to see the data with more insight of data and not relying on specific set of values.  Thus it is used to create more results form given data sets.
  7. 7. Image Source:
  8. 8.  Numerous Sources  Cookies, IP Tracking  Person tracking  Social Messages on Social network web sites(e.g. Facebook, Twitter)  Stock market trades  And counting….
  9. 9. Origin Uses Websites User Preferences, Shopping Interests Social Messages Public Interests, Opinions Digital Receipts Personalized Purchase Suggestions Healthcare Data Preparing for diseases ,Predecion Telecom Data New Technologies Space Data Inventions of new space technology
  10. 10.  We have large amount of data(!!!).  Now the problem is analyst can discover “meaningless” pattern .  Statisticians call it Bonferroni`s Principle.  “Roughly if you look at more and more places for important pattern than your amount of data can support almost anything.” Source: taken from Rajaramn,Ulman:Mining of Massive Datasets
  11. 11.  We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day  109 people being tracked  1000 days  Each person stays in a hotel 1% of the time (1 day out of 100)  Hotels hold 100 people (so 105 hotels)  If everyone behaves randomly (i.e., no terrorists) will the data mining detect anything suspicious?  Expected number of "suspicious" pairs of people:  250,000  …too many combinations to check - we need to have some additional evidence to find "suspicious" pairs of people in some more efficient way Source: taken from Rajaramn,Ulman:Mining of Massive Datasets
  12. 12.  As Big data concept is new, there is no specific standards available.  Big data working groups and initiatives  Open Data Center Alliance (ODCA)  TMF Big Data Analytics Reference Architecture  Research Data Alliance (RDA)  NIST Big Data Working Group (NBD-WG)
  13. 13.  The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.[from]  IBM, Yahoo, Microsoft have their own products and technology for Big Data.  Hadoop project is started by Yahoo research.
  14. 14.  Hadoop is a Scalable, Reliable, Fault-tolerant and Simple software library framework.  Logically Hadoop is computing cluster that provides storage layer and execution layer. Source:A (very) short intro to Hadoop by Ken Krugler`s talk at BigDataCamp held in Washington DC November 2011 Storage layer Execution Layer Hadoop Distributed File System Hadoop MapReduce Runs on regular os file system like Linux ext3 Runs on many servers Fixed size blocks, normally 64 mb in size, are replicated Job consist special “Map” and “Reduce” functions.
  15. 15. Source:A (very) short intro to Hadoop by Ken Krugler`s talk at BigDataCamp held in Washington DC November 2011
  16. 16.  Google published research paper describing the technology that can process hundreds of thousand of CPU and provide faster execution called MapReduce.  It has two main functionalities, Mapping and Reducing.  Mapping is used to process key/value pairs and produce set of intermediate pairs.  Reduce works for combining all intermediate values and produce merged output. Source:
  17. 17. Data Collection Cust_id: A123 Amount: 500 Cust_id: A123 Amount: 250 Cust_id: B212 Amount: 200 Cust_id: A223 Amount: 250 Query (Customers with A213 and B212) Cust_id: A123 Amount: 500 Cust_id: A123 Amount: 250 Cust_id: B212 Amount: 200 Map( Cust_id With Amount) A213 {500,250} B212 {200} Reduce(Sum of Amount for Given Cust_id) Cust_id : A213, Amount : 750 Cust_id : B212, Amount : 200
  18. 18.  Hive  Apache Mahout  Processing Big Data with MATLAB  Revolution R
  19. 19.  Hive is SQL like technology which sits on top of Hadoop Clusters.  Hive provides Hive Query Language (HQL) which allows SQL developers to write queries similar to SQL.  One can use HQL queries on Hive Shell or can run from JDBC/ODBC using drivers called Hive Thrift Clients.  Hive is based on Hadoop and MapReduce.  The key difference between HQL and SQL is that hadoop is intended for long sequence scans,we can have latency in minutes.
  20. 20.  Apache Mahaout is scalable machine learning library.  Uses of Machine Learning  Generation of Recommendations based on previous clicks  Classifying DNA sequences  Bioinformatics, Natural Language Processing  A mahout is a person who keeps and drives an elephant. The name Mahout comes from the project's use of Apache Hadoop — which has a yellow elephant as its logo — for scalability and fault tolerance
  21. 21.  Apache Mahaout`s algorithms for clustering, classification and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm.  Mahaout provides very business intelligence features like collaborative learning, clustering etc.  Collaborative filtering (CF) is a technique, popularized by Amazon and others, that uses user information such as ratings, clicks, and purchases to provide recommendations to other site users.  Clustering is a technique to cluster datasets on given condition. e.g. Given all the news for a day in all news paper from whole India,one might want to group all articles related to same story automatically.
  22. 22.  MATLAB (Matrix Laboratory) is a numerical computing environment and fourth generation language developed by MathWorks.
  23. 23.  Memory Mapped Variables. This allows you to efficiently access big data sets on disk that are too large to hold in memory or that take too long to load.  Intrinsic Multicore Math. Many of the built-in mathematical functions in MATLAB, such as fft, inv, and eig, are multithreaded.  Cloud Computing. You can run MATLAB computations in parallel using MATLAB Distributed Computing Server on Amazon’s Elastic Computing Cloud (EC2) for on-demand parallel processing on hundreds or thousands of computers.
  24. 24.  R is a statistical analysis language, developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand.  It is called “R” as it is initial of the developers.  R has ability to do statistical and graphical analysis and provide clustering, classifications on given data sets.  R is object oriented programming language and it is highly extensible as users can submit specific packages for specific area of interests.
  25. 25.  Revolution R is developed by a company called Revolution Analytics.  The concept on which company developed “Open Core ” solution based on R is all the data to be analyzed are held in memory.  This concept is not possible in case of large data sets.  Revolution R provides new file format for large data sets.  Parallel external memory implementation and parallel algorithms for Big Data.
  26. 26.  As there is no standardization and data sets are growing larger and larger day by day, everybody is suggesting new solution.  The trend is combine existing technologies and provide new architecture.  The situation is that we don’t know what we could already know.  Big data is like junction where multiple roads from very different directs intersects.  Big Data is certainly a future, with new possibilities and opportunities.
  27. 27.  Hsinchun Chen, Roger H. L. Chiang, & Veda C. Storey (2012, December). MIS Quarterly, Vol. 36, 1165-1188  Phillip Redman, John Girard, Leif-Olof Wallin (13 April 2011). Magic Quadrant for Mobile Device Management Software, Gartner Research, ID no: G00211101, 1-25  Adam Jacobs, (August 2009). The Pathologies of Big Data, Vol 52, No 8. Communications of ACM. 36-44  Jeffery Dean & Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Google Inc Research Paper, OSDI 2004. 1-12  Samet Ayhan , Johnathan Pesce, Paul Comitz, Gary Gerberick & Steve Bliesner . Predictive Analytics with Surveillance Big Data. 81-90  Divyakant Agrawal, Sudipto Das & Amr El Abbadi. Big Data and Cloud Computing: Current State and Future.530-533  Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, Caleb Welton, MAD Skills: New Analysis Practices for Big Data, 1481-1492  IntroToHive.pdf (accessed on 02/10/2013)  (accessed on 02/10/2013)