Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

5,033 views

Published on

Large scale data processing analyses and makes sense of large amounts of data. Spanning many fields, Large scale data processing brings together technologies like Distributed Systems, Machine Learning, Statistics, and Internet of Things together. It is a multi-billion-dollar industry including use cases like targeted advertising, fraud detection, product recommendations, and market surveys. With new technologies like Internet of Things (IoT), these use cases are expanding to scenarios like Smart Cities, Smart health, and Smart Agriculture. Some usecases like Urban Planning can be slow, which is done in batch mode, while others like stock markets need results within Milliseconds, which are done in streaming fashion. Predictive analytics let us learn models from data often providing us ability to predict the outcome of our actions.

WSO2 Data analytics platform is fast and scalable platform that is being used by more than 40 organizations including Banks, Financial Institutions, Smart Cities, Hospitals, Media Companies, Telecom Companies, State and Federal Governments, and High Tech companies. This talk will start with a discussion on large scale data analysis. Then we will look at WSO2 Data analytics platform and discuss in detail how we can use the platform to build end to end Big data applications combining power of batch processing, real-time analytics, and predictive technologies.

Published in: Data & Analytics

Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

  1. 1. Introduction to Large Scale Data Analysis and WSO2 Analytics Platform Srinath Perera Director Research WSO2, Apache Member (@srinath_perera) srinath@wso2.com At Indiana University Bloomington
  2. 2. Who We are? We are an opensource Middleware company - We build systems upon which others build their systems Venture funded – Intel Capital, Cisco, Toba Capital 400+ people & Offices at Silicon valley, Sri Lanka, London and Bloomington Customers including Banks, Aircraft Manufacturers, Governments (State and Federal), Media Companies, Telco, Retail, Healthcare ..
  3. 3. Outline Introduction to Big Data The Problem we are trying to solve WSO2 Big Data Platform Next steps
  4. 4. A Day inYour Life Think about a day in your life? - What is the best road to take? - Would there be any bad weather? - How to invest my money? - How is my health? There are many decisions that you can do better if only you can access the data and process them. http://www.flickr.com/photos/kcolwell/55124616 CC licence
  5. 5. Internet ofThings Currently th physical world and software worlds are detached Internet of things promises to bridge this - It is about sensors and actuators everywhere - In your fridge, in your blanket, in your chair, in your carpet.. Yes even in your socks - Umbrella that light up when there is rain and medicine cups
  6. 6. What can We do with Big Data? Optimize (World is inefficient) - 30% food wasted farm to plate - GE Save 1% initiative (http://goo.gl/eYC0QE ) - Trains => 2B/ year - US healthcare => 20B/ year Save lives - Weather, Disease identification, Personalized treatment Technology advancement - Most high tech research are done via simulations
  7. 7. Big Data Architecture
  8. 8. Big data ProcessingTechnologies Landscape
  9. 9. (Batch) Analytics Scientists are doing this for 25 year with MPI (1991) on special Hardware - OpenMPI is being done at IU! Took off with Google’s MapReduce paper (2004), Apache Hadoop, Hive and whole eco system created.  It was successful, So we are here!! But, processing takes time.
  10. 10. Usecase:Targeted Advertising Analytics Implemented with MapReduce or Queries - Min, Max, average, correlation, histograms, might join or group data in many ways - Heatmaps, temporal trends Key Performance indicators (KPIs) - E.g. Profit per square feet for retail
  11. 11. Usecase: Big Data for development Done using CDR data People density noon vs. midnight (red => increased, blue => decreased) Urban Planning - People distribution - Mobility - Waste Management - E.g. see http://goo.gl/jPujmM From: http://lirneasia.net/2014/08/what-does-big-data-say-about-sri-lanka/
  12. 12. Value of some Insights degrade Fast! For some usecases ( e.g. stock markets, traffic, surveillance, patient monitoring) the value of insights degrades very quickly with time. - E.g. stock markets and speed of light We need technology that can produce outputs fast - Static Queries, but need very fast output (Alerts, Realtime control) - Dynamic and Interactive Queries ( Data exploration)
  13. 13. Predictive Analytics  If we know how to solve a problem, that is if we know a finite set of rules, then we can programs it.  For some problems (e.g. Drive a car, character recognition), we do not know a finite fix rule set.  Instead of programming, we give lot of examples and ask the computer to learn (often called Machine Learning)  Lot of tools - R ( Statistical language) - Sci-kit learn (Phython) - Apache Spark’s MLBase and Apache Mahout (Java)
  14. 14. Usecase: Predictive Maintenance Idea is to fix the problem before it happens, avoiding expensive downtimes - Airplanes, turbines, windmills - Construction Equipment - Car, Golf carts How - Build a model for normal operation and compare deviation - Match against known error patterns
  15. 15. Problem we are trying to Solve! Build a platform using which others can build their analytics systems - Collect, Analyze, Communicate - End to end, starts from humans and ends with humans Different Audiences - Technical (Developers) - Non-technical (CXOs, sales, analysts) There are two things you need to know about business,: make something users love and make more than you spend. --Paul Graham ( Lisp, Y-combinator)
  16. 16. Running Example Monitor Temperature and hot airflow across multiple buildings (e.g. central AC) - More people => hot Analytics - Historical behavior of temperature by the hour - Alerts if temperature falls too much or too high - Modeling and predicating temperature to adjust proactively define TemperatureStream(ts long, buildingNo long, t double); define AirflowStream(ts long, buildingNo long, aflow double, aT);
  17. 17. Collect Data One Sensor API to publish events - REST, Thrift, Java, JMS, Kafka - Java clients, java script clients* First you define streams (think it as a infinite table in SQL DB) Then send events via API * Challenges ( performance, guaranteed delivery, scale) Can send to batch pipeline, Realtime pipeline or both via configuration!
  18. 18. Collecting Data: Example Java example: create and send events Events send asynchronously See client given in http://goo.gl/vIJzqc for more info Agent agent = new Agent(agentConfiguration); publisher = new AsyncDataPublisher("tcp://hostname:7612", .. ); StreamDefinition definition = new StreamDefinition(STREAM_NAME,VERSION); definition.addPayloadData("sid", STRING); ... publisher.addStreamDefinition(definition); ... Event event = new Event(); event.setPayloadData(eventData); publisher.publish(STREAM_NAME, VERSION, event); Send events Define Stream Initialize Stream
  19. 19. Batch Analytics: Spark Two frameworks: Hadoop (http://hadoop.apache.org ) and Spark (https://spark.apache.org ) - Hadoop is a MapReduce implementation Spark is faster (30X and ) and much more flexible. They set a record at Gray Sort (100TB) 3X faster with 10X less machines, http://goo.gl/r5LGvD For Hadoop and MapReduce resources, Google it. file = spark.textFile("hdfs://...”) file.flatMap(tsToHourFunction) .reduceByKey(lambda a, b: a+b)
  20. 20. SQL like Queries: Hive Apache Hive provides a SQL like data processing language Since many understands SQL, Hive made large scale data processing Big Data accessible to many Expressive, short, and sweet. Define core operations that covers 90% of problems Lets experts dig in when they like! (via User Defined functions)
  21. 21. HourlyTemperature Average Hive compile the SQL like query to set of MapReduce jobs running in Hadoop or Spark (in WSO2 BAM from 15, Q2 release) insert overwrite table TemperatureHistory select hour, average(t) as avgT, buildingId from TemperatureStream group by buildingId, getHour(ts);
  22. 22. Complex Event Processing
  23. 23. Operators: Filters Assume a temperature stream Here weather:convertFtoC() is a user defined function. They are used to extend the language. define stream TemperatureStream(ts long, temp double); from TemperatureStream[weather:convertFtoC(temp) > 30.0) and roomNo != 2043] select roomNo, temp insert into HotRoomsStream ; Usecases: - Alerts , thresholds (e.g. Alarm on high temperature) - Preprocessing: filtering, transformations (e.g. data cleanup)
  24. 24. Operators:Windows and Aggregation Support many window types - Batch Windows, Sliding windows, Custom windows Usecases - Simple counting (e.g. failure count) - Counting with Windows ( e.g. failure count every hour) from TemperatureStream#window.time(1 min) select roomNo, avg(temp) as avgTemp insert into HotRoomsStream ;
  25. 25. Operators: Patterns Models a followed by relation: e.g. event A followed by event B Very powerful tool for tracking and detecting patterns from every (a1 = TemperatureStream) -> a2 = TemperatureStream [temp > a1.temp + 5 ] within 1 day select a2.ts as ts, a2.temp – a1.temp as diff insert into HotDayAlertStream; Usecases - Detecting Event Sequence Patterns - Tracking - Detect trends
  26. 26. Operators: Joins Join two data streams based on a condition and windows Usecases - Data Correlation, Detect missing events, detecting erroneous data - Joining event streams from TemperatureStream [temp > 30.0]#window.time(1 min) as T join RegulatorStream[isOn == false]#window.length(1) as R on T.roomNo == R.roomNo select T.roomNo, R.deviceID, ‘start’ as action insert into RegulatorActionStream
  27. 27. Operators:Access Data from the Disk Event tables allow users to map a database to a window and join a data stream with the window Usecases - Merge with data in a database, collect, update data conditionally define table HistTempTable(day long, avgT double); from TemperatureStream#window.length(1) join OldTempTable on getDayOfYear(ts) == HistTempTable.day && ts > avgT select ts, temp insert into PurchaseUserStream ;
  28. 28. Realtime Analytics Patterns Simple counting (e.g. failure count) Counting with Windows ( e.g. failure count every hour) Preprocessing: filtering, transformations (e.g. data cleanup) Alerts , thresholds (e.g. Alarm on high temperature) Data Correlation, Detect missing events, detecting erroneous data (e.g. detecting failed sensors) Joining event streams (e.g. detect a hit on soccer ball) Merge with data in a database, collect, update data conditionally
  29. 29. Realtime Analytics Patterns (contd.) Detecting Event Sequence Patterns (e.g. small transaction followed by large transaction) Tracking - follow some related entity’s state in space, time etc. (e.g. location of airline baggage, vehicle, tracking wild life)  Detect trends – Rise, turn, fall, Outliers, Complex trends like triple bottom etc., (e.g. algorithmic trading, SLA, load balancing) Learning a Model (e.g. Predictive maintenance) Predicting next value and corrective actions (e.g. automated car)
  30. 30. Predictive Analytics  Build models and use them with WSO2 CEP, BAM and ESB using upcoming WSO2 Machine Learner Product ( 2015 Q2)  Build model using R, export them as PMML, and use within WSO2 CEP  Call R Scripts from CEP queries  Regression and Anomaly Detection Operators in CEP
  31. 31. Predictive Analytics  WSO2 Machine Learner provide an wizard to explore and build model  E.g. Build a model to predict next 15 minutes temperature - Trivial Option : (historical mean +last 15m mean)/2 - Better model via ARIMA from time series analysis  To know more, take a ML class
  32. 32. Communicate: Dashboards  Idea is to given the “Overall idea” in a glance (e.g. car dashboard)  Support for personalization, you can build your own dashboard.  Also the entry point for Drill down  How to build? - Dashboard via Google Gadget and content via HTML5 + java scripts - Use WSO2 User Engagement Server to build a dashboard. (or a JSP or PHP) - Use charting libraries like Vega or D3
  33. 33. Communicate: Dashboards  Idea is to given the “Overall idea” in a glance (e.g. car dashboard)  Support for personalization, you can build your own dashboard.  Also the entry point for Drill down  How to build? - Dashboard via Google Gadget and content via HTML5 + java scripts - Use WSO2 User Engagement Server to build a dashboard. (or a JSP or PHP) - Use charting libraries like Vega or D3
  34. 34. Communicate:Alerts  Detecting conditions can be done via CEP Queries  Key is the “Last Mile” - Email - SMS - Push notifications to a UI - Pager - Trigger physical Alarm  How? - Select Email sender “Output Adaptor” from CEP, or send from CEP to ESB, and ESB has lot of connectors
  35. 35. Communicate:APIs  With mobile Apps, most data are exposed and shared as APIs (REST/Json ) to end users.  Following are some challenges - Security and Permissions - API Discovery - Billing, throttling, quote - SLA enforcement  How? - Write data to a database from CEP event tables - Build Services via WSO2 Data Service - Expose them as APIs via API Manager
  36. 36. Smart Home 2015 yearly DEBS (Distributed Event Based Systems) DEBS Grand Challenge (http://goo.gl/0htxlj) Smart Home electricity data: 2000 sensors, 40 houses, 4 Billion events We posted (400K events/sec) and close to one million distributed throughput with 4 nodes. WSO2 CEP based solution is one of the four finalists (with Dresden University of Technology, Fraunhofer Institute, and Imperial College London) Only generic solution to become a finalist
  37. 37. Case Study: Realtime Soccer Analysis Watch at: https://www.youtube.com/watch?v=nRI6buQ0NOM
  38. 38. Case Study:TFLTraffic Analysis Built using TFL ( Transport for London) open data feeds. http://goo.gl/04tX6k http://goo.gl/9xNiCm
  39. 39. WSO2 Big Data Analytics Platform
  40. 40. Conclusion Goal: Build a platform using which others can build their analytics systems - End to end, starts from humans and ends with humans Whole platform is opensource under Apache License What can you do with the platform? - Solve hard problems, build Great Apps with the platform - Add and contribute extensions to the platform (e.g. GSoc http://goo.gl/QNFP6Y ) - Fix problems ( Patches) Find us at architecture@wso2.org list or Stackoverflow (tag wso2)
  41. 41. Questions?

×