Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Analysis : Deciphering the haystack


Published on

A primary outcome of Bigdata is to derive useful and actionable insights from large or challenges data collections. The goal is to run the transformations from data, to information, to knowledge, and finally to insights. This includes calculating simple analytics like Mean, Max, and Median, to derive overall understanding about data by building models, and finally to derive predictions from data. Some cases we can afford to wait to collect and processes them, while in other cases we need to know the outputs right away. MapReduce has been the defacto standard for data processing, and we will start our discussion from there. However, that is only one side of the problem. There are other technologies like Apache Spark and Apache Drill graining ground, and also realtime processing technologies like Stream Processing and Complex Event Processing. Finally there are lot of work on porting decision technologies like Machine learning into big data landscape. This talk discusses big data processing in general and look at each of those different technologies comparing and contrasting them.

Published in: Technology
  • Be the first to comment

Big Data Analysis : Deciphering the haystack

  1. 1. Outline • Big Data • Big Data Processing – To Know – To explain – To forecast • Conclusion Photo by John Trainoron Flickr, Licensed under CC
  2. 2. A day in your life  Think about a day in your life? – What is the best road to take? – Would there be any bad weather? – How to invest my money? – How is my health?  There are many decisions that you can do better if only you can access the data and process them. 512461652/ CC licence
  3. 3. Data Avalanche/ Moore’s law of data • We are now collecting and converting large amount of data to digital forms • 90% of the data in the world today was created within the past two years. • Amount of data we have doubles very fast
  4. 4. Internet of Things • Currently physical world and software worlds are detached • Internet of things promises to bridge this – It is about sensors and actuators everywhere – In your fridge, in your blanket, in your chair, in your carpet.. Yes even in your socks – Google IO pressure mats
  5. 5. What can we do with Big Data? • Optimize – 1% saving in Airplanes and turbines can save more than 1B$ each year (GE talk, Statra 2014). SL GDP 60B • Save lives – Weather, Disease identification, Personalized treatment • Technology advancement/ Understanding – Most high tech work are done via simulations
  6. 6. Big Data Architecture
  7. 7. Why Big Data is hard? • How store? Assuming 1TB bytes it takes 1000 computers to store a 1PB • How to move? Assuming 10Gb network, it takes 2 hours to copy 1TB, or 83 days to copy a 1PB • How to search? Assuming each record is 1KB and one machine can process 1000 records per sec, it needs 277CPU days to process a 1TB and 785 CPU years to process a 1 PB • How to process? – How to convert algorithms to work in large size – How to create new algorithms
  8. 8. Why it is hard (Contd.)? • System build of many computers • That handles lots of data • Running complex logic • This pushes us to frontier of Distributed Systems and Databases • More data does not mean there is a simple model • Some models can be complex as the system, Licensed CC
  9. 9. Tools for Processing Data
  10. 10. Big data Processing Technologies Landscape
  11. 11. MapReduce/ Hadoop • First introduced by Google, and used as the processing model for their architecture • Implemented by opensource projects like Apache Hadoop and Spark • Users writes two functions: map and reduce • The framework handles the details like distributed processing, fault tolerance, load balancing etc. • Widely used, and the one of the catalyst of Big data void map(ctx, k, v){ tokens = v.split(); for t in tokens ctx.emit(t,1) } void reduce(ctx, k, values[]){ count = 0; for v in values count = count + v; ctx.emit(k,count); }
  12. 12. MapReduce (Contd.)
  13. 13. Histogram of Speeds
  14. 14. Histogram of Speeds(Contd.) void map(ctx, k, v){ event = parse(v); range = calcuateRange(event.v); ctx.emit(range,1) } void reduce(ctx, k, values[]){ count = 0; for v in values count = count + v; ctx.emit(k,count); }
  15. 15. Hadoop Landscape • HIVE - Query data using SQL style queries, and Hive will convert them to MapReduce jobs and run in Hadoop. • Pig - We write programs using data flow style scripts, and Pig convert them to MapReduce jobs and run in Hadoop. • Mahout – Collection of MapReduce jobs implementing many Data mining and Artificial Intelligence algorithms using MapReduce A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float, lifeex:int, mysch:int, eysch:int, gni:int); B = FILTER A BY gni > 2000; C = ORDER B BY gni; dump C; hive> SELECT country, gni from HDI WHERE gni > 2000;
  16. 16. Is Hadoop Enough? • Limitations – Takes time for processing – Lack of Incremental processing – Weak with Graph usecases – Not very easy to create a processing pipeline (addressed with HIVE, Pig etc.) – Too close to programmers – Faster implementations are possible • Alternatives – Apache Drill – Spark – Graph Processors like
  17. 17. Spark • New Programming Model built on functional programming concepts • Can be much faster for recursive usecases • Have a complete stack of products file = spark.textFile("hdfs://...”) file.flatMap( line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
  18. 18. Real-time Analytics • Idea is to process data as they are received in streaming fashion • Used when we need – Very fast output – Lots of events (few 100k to millions) – Processing without storing (e.g. too much data) • Two main technologies – Stream Processing (e.g. Strom, ) – Complex Event Processing (CEP) event-processor/
  19. 19. Complex Event Processing (CEP) • Sees inputs as Event streams and queried with SQL like language • Supports Filters, Windows, Join, Patterns and Sequences from p=PINChangeEvents#win.time(3600) join t=TransactionEvents[p.custid=custid][amount>1000 0] #win.time(3600) return t.custid, t.amount;
  20. 20. DEBS Challenge • Event Processing challenge • Real football game, sensors in player shoes + ball • Events in 15k Hz • Event format – Sensor ID, TS, x, y, z, v, a • Queries – Running Stats – Ball Possession – Heat Map of Activity – Shots at Goal
  21. 21. Example: Detect ball Possession • Possession is time a player hit the ball until someone else hits it or it goes out of the ground from Ball#window.length(1) as b join Players#window.length(1) as p unidirectional on debs: getDistance(b.x,b.y,b.z, p.x, p.y, p.z) < 1000 and b.a > 55 select ... insert into hitStream from old = hitStream , b = hitStream [old. pid != pid ], n= hitStream[ == pid]*, ( e1 = hitStream[ != pid ] or e2= ballLeavingHitStream) select ... insert into BallPossessionStream
  22. 22. Information Visualization • Presenting information – To end user – To decision takers – To scientist • Interactive exploration • Sending alerts embra/3604686097/
  23. 23. Napoleon's Russian Campaign
  24. 24. Show Han Rosling’s Data • This use a tool called “Gapminder”
  25. 25. GNU Plot set terminal png set output "buyfreq.png" set title "Frequency Distribution of Items brought by Buyer"; setylabel "Number of Items Brought"; setxlabel "Buyers Sorted by Items count"; set key left top set log y set log x plot "" using 2 title "Frequency" with linespoints • Open source • Very powerful
  26. 26. Web Visualization Libraries • D3, Flot etc. • Could generate compelling graphs
  27. 27. Solving the Problem
  28. 28. Making Sense of Data • To know (what happened?) – Basic analytics + visualizations (min, max, average, histogram, distributions … ) – Interactive drill down • To explain (why) – Data mining, classifications, building models, clustering • To forecast – Neural networks, decision models
  29. 29. To know (what happened/ happening?) • Analytics Implemented with MapReduce or Queries – Min, Max, average, correlation, histograms – Might join group data in many ways • Data is often presented with some visualizations • Support for drill down • Real-time analytics to track things fast enough and Alerts. 333/
  30. 30. Analytics/ Summary • Min, Max, Mean, Standard deviation, correlation, grouping etc. • E.g. Internet Access logs – Number of hits, total, by day or hour, trend over time – File size distribution – Request Success to failure ratio – Hits by client IP region, by Steven Iodice
  31. 31. Drill Down • Idea is to let users drill down and explore a view of the data (called BI or OLAP) • Users go and define 3 dimensions (or more), and tool let users explore the cube and only look at subset of data. • E.g. tool: Mondrian, by Steven Iodice
  32. 32. Search • Process and Index the data. The killer app of our time., by Steven Iodice • Web Search • Graph Search • Semantic Search • Drug Discovery • …
  33. 33. Search: New Ideas • Apache Drill, Big Query • Blink: search via sampling
  34. 34. To Explain (Patterns) • Understanding underline patterns from data • Correlation – Scatter plot, Regression • Data Mining (Detecting Patterns) – Clustering – Dimension Reduction – Finding Similar items – Finding Hubs and authorities in a Graph – Finding frequent item sets – Making recommendation and
  35. 35. Usecase 1: Clustering • Cluster by location • Cluster by similarity • Cluster to identify clicks (e.g. fast information distribution, forensic) • E.g. Earth Quake Data
  36. 36. Usecase 2: Regression • Find the relationship between variables • E.g. find the trend • E.g. relationship between money spent and Education level improvement
  37. 37. Tools • Weka – java machine learning library • Mahout – MapReduce implementation of Machine learning algorithms • R – programming language for statistical computing • PMML (Predictive model markup language) • Scaling these libraries is a challenge
  38. 38. Forecasts and Models • Trying to build a model for the data • Theoretically or empirically – Analytical models (e.g. Physics, but PDEs kill you) – Simulations (Numerical methods) – Regression Models – Time Series Models – Neural network models – Reinforcement learning – Classifiers – dimensionality reduction, kernel methods) • Examples – Translation – Weather Forecast models – Building profiles of users – Traffic models – Economic models 0_09_01_archive.html
  39. 39. Usecase 1: Modeling Solar System • PDEs not solvable • Simulations • Other examples: Weather
  40. 40. Usecase 2: Electricity Demand Forecast • Time Series • Moving averages • Cycles
  41. 41. Usecase 3: Targeted Marketing • Collect data and build a model on – What user like – What he has brought – What is his buying power • Making recommendations • Giving personalized deals
  42. 42. Big data lifecycle • Realizing the big data lifecycle is hard • Need wide understanding about many fields • Big data teams will include members from many fields working together
  43. 43. • Data Scientists: A new role, evolving Job description • His Job is to make sense of data, by using Big data processing, advance algorithms, Statistical methods, and data mining etc. • Likely to be a highest paid/ cool Jobs. • Big organizations (Google, Facebook, Banks) hiring Math PhD by a lot for this. Data Scientist Statisticians System experts Domain Experts DB experts AI experts
  44. 44. Challenges • Fast scalable real-time analytics • Merging real-time and batch processing • Good libraries for machine learning • Scaling decision models e.g. machine learning
  45. 45. Conclusions • Big data? How, Why and Bigdata architecture • Data Processing Tools – MapReduce, CEP, visualizations • Solving problems – To know what happened – To explain what happened – To predict what will happen • Conclusions
  46. 46. Questions?