Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Science in the Real World: Making a Difference


Published on

We use the terms “Big Data” and “Data Science” for use of data processing to make sense of the world around us. Spanning many fields, Big Data brings together technologies like Distributed Systems, Machine Learning, Statistics, and Internet of Things together. It is a multi-billion-dollar industry including use cases like targeted advertising, fraud detection, product recommendations, and market surveys. With new technologies like Internet of Things (IoT), these use cases are expanding to scenarios like Smart Cities, Smart health, and Smart Agriculture.

These usecases use basic analytics, advanced statistical methods, and predictive technologies like Machine Learning. However, it is not just about crunching the data. Some usecases like Urban Planning can be slow, and there is enough time to process the data. However, with use cases like traffic, patient monitoring, surveillance the the value of results degrades much faster with time and needs results within milliseconds to seconds. Collecting data from many sources, cleaning them up, processing them using computation clusters, and doing all these fast is a major challenge.

This talk will discuss motivation behind big data and data science and how it can make a difference. Then it will discuss the challenges, systems, and methodologies for implementing and sustaining a data science pipeline.

Published in: Data & Analytics
  • If you want a girl to "chase" you, then you have to use the right "bait". We discovered 4 specific things that FORCE a girl to chase after you and try to win YOU over. copy and visiting... ★★★
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website!
    Are you sure you want to  Yes  No
    Your message goes here

Data Science in the Real World: Making a Difference

  1. 1. Data Science in the Real World: Making a Difference Srinath Perera Director Research WSO2, Apache Member (@srinath_perera) StatDay 2015 @ University of Colombo
  2. 2. Outline  Making sense of World’s Data  Building Data Systems  Changing Dynamics of Data Analysis with Big Data ( Sensor Data)  Challenges and Open Problems
  3. 3. Michael Stonebraker “But then, out of nowhere, some marketing guys started talking about ‘big data, That’s when I realized that I’d been studying this thing for the better part of my academic life.”
  4. 4. Michael Stonebraker “But then, out of nowhere, some marketing guys started talking about ‘big data, That’s when I realized that I’d been studying this thing for the better part of my academic life.” ACM Turing Award,
  5. 5. A Day inYour Life Think about a day in your life? - What is the best road to take? - Would there be any bad weather? - How to invest my money? - How is my health? There are many decisions that you can do better if only you can access the data and process them. 1652/ CC licence
  6. 6. What can We do with Data? Optimize (World is inefficient) - 30% food wasted farm to plate - GE Save 1% initiative ( ) - Trains => 2B/ year - US healthcare => 20B/ year Save lives - Weather, Disease identification, Personalized treatment Technology advancement - Most high tech research are done via simulations
  7. 7. Building Data Processing Systems
  8. 8. Data Science Architecture
  9. 9. Data ProcessingTechnologies Landscape
  10. 10. Batch Processing Store and process Slow (> 5 minutes for results for a reasonable usecase) Programming model is MapReduce - Apache Hadoop - Spark Lot of tools built on top - Hive Shark for (SQL style queries), Mahout (ML), Giraph (Graph Processing)
  11. 11. Usecase: Big Data for development Done using CDR data People density noon vs. midnight (red => increased, blue => decreased) Urban Planning - People distribution - Mobility - Waste Management - E.g. see From:
  12. 12. Value of some Insights degrade Fast! For some usecases ( e.g. stock markets, traffic, surveillance, patient monitoring) the value of insights degrades very quickly with time. - E.g. stock markets and speed of light We need technology that can produce outputs fast - Static Queries, but need very fast output (Alerts, Realtime control) - Dynamic and Interactive Queries ( Data exploration)
  13. 13. Complex Event Processing
  14. 14. Predictive Analytics  If we know how to solve a problem, that is if we know a finite set of rules, then we can programs it.  For some problems (e.g. Drive a car, character recognition), we do not know a finite fix rule set.  Instead of programming, we give lot of examples and ask the computer to learn (often called Machine Learning)  Lot of tools - R ( Statistical language) - Sci-kit learn (Phython) - Apache Spark’s MLBase and Apache Mahout (Java)
  15. 15. Usecase: Predictive Maintenance Idea is to fix the problem before it broke, avoiding expensive downtimes - Airplanes, turbines, windmills - Construction Equipment - Car, Golf carts How - Build a model for normal operation and compare deviation - Match against known error patterns
  16. 16. Communicate: Dashboards  Idea is to given the “Overall idea” in a glance (e.g. car dashboard)  Support for personalization, you can build your own dashboard.  Also the entry point for Drill down  How to build? - Expose data via JSON - Build Dashboard via Google Gadget and content via HTML5 + java scripts (Use charting libraries like Vega or D3)
  17. 17. Communicate:Alerts andTriggers Detecting conditions can be done via Event Processing system ( e.g. CEP) Key is the “Last Mile” - Email - SMS - Push notifications to a UI - Pager - Trigger physical Alarm
  18. 18. Case Study: Realtime Soccer Analysis Watch at:
  19. 19. Changing Dynamics
  20. 20. Large Observational Datasets Stats are easy with designed experiments - You got to select a representative set - You have a control group You have lot and lot of data and lot and lot of computing power ( compared to what you had) Two reactions!!
  21. 21. “It is better to be roughly right than precisely wrong.” John Keynes― In the long run, we are all Dead!!
  22. 22. Challenges: Causality  Correlation does not imply Causality!! ( send a book home example [1])  Causality - do repeat experiment with identical test - If CAN’T do a randomized test (A/B test) - With Big data we cannot do either  Option 1: We can act on correlation if we can verify the guess or if correctness is not critical (Start Investigation, Check for a disease, Marketing )  Option 2: We verify correlations using A/B testing or propensity analysis [1] [2]
  23. 23. Curious Case of Missing Data, Pic from •WW II, Returned Aircrafts and data on where they were hit? •How would you add Armour?
  24. 24. More Data Beat a Clever Algorithm Observed by large internet companies Also seen over keggle Competitions E.g. SVM vs. Logistic regression Read “A Few Useful Things to Know about Machine Learning” (Pedro Domingos)
  25. 25. Challenges: Feature Engineering In ML feature engineering is the key [1]. You need features to form a kernel. Then you can solve with less data. Deep learning can learn best feature (combination) via semi or unsupervised learning [2] 1. Bekkerman’s talk 2. Deep Learning,
  26. 26. Challenges:Taking Decisions (Context)
  27. 27. Challenges: Updating Models ● Incorporate more data o We get more data over time o We get feed back about effectiveness of decisions (e.g. Accuracy of Fraud) o Trends change ● Track and update model o Generate models in batch mode and update o Streaming (Online) ML, which is an active research topic
  28. 28. Challenges: Lack of Labeled Data •Most data is not labeled •Idea of Semi Supervised learning •Provide Data + Examples + Ontology, and algorithm find new patterns –Lot of Data –Few example sentences •Often uses Expectations Maximization (EM) Algorithm Watch Tom Mitchell’s Lecture Ontology: People, Cities Relationships: like, dislike, live in Examples: Bob (People) lives in Colombo (City)
  29. 29. TwoTakeaways Do your data Processing as part of a Bigger system - Think Systems, automate, make a difference - Realtime vs Batch - Use tools ( Do not reinvent the wheel) Think how dynamics are changing (Uncontrolled experiments, lot of Data) - Do not be a data Pessimist - However, do not do stupid things either
  30. 30. Questions?