SlideShare a Scribd company logo
1 of 31
Data Science in the Real
World: Making a
Difference
Srinath Perera
Director Research WSO2, Apache Member
(@srinath_perera)
srinath@wso2.com
StatDay 2015 @ University of Colombo
Outline
 Making sense of World’s Data
 Building Data Systems
 Changing Dynamics of Data Analysis
with Big Data ( Sensor Data)
 Challenges and Open Problems
Michael Stonebraker
“But then, out of nowhere, some
marketing guys started talking
about ‘big data, That’s when I
realized that I’d been studying
this thing for the better part of
my academic life.”
Michael Stonebraker
“But then, out of nowhere, some
marketing guys started talking
about ‘big data, That’s when I
realized that I’d been studying
this thing for the better part of
my academic life.”
ACM Turing Award,
A Day inYour Life
Think about a day in your life?
- What is the best road to take?
- Would there be any bad weather?
- How to invest my money?
- How is my health?
There are many decisions that you can do
better if only you can access the data and
process them.
http://www.flickr.com/photos/kcolwell/551246
1652/ CC licence
What can We do with Data?
Optimize (World is inefficient)
- 30% food wasted farm to plate
- GE Save 1% initiative (http://goo.gl/eYC0QE )
- Trains => 2B/ year
- US healthcare => 20B/ year
Save lives
- Weather, Disease identification, Personalized treatment
Technology advancement
- Most high tech research are done via simulations
Building Data
Processing Systems
Data Science Architecture
Data ProcessingTechnologies Landscape
Batch Processing
Store and process
Slow (> 5 minutes for results for
a reasonable usecase)
Programming model is
MapReduce
- Apache Hadoop
- Spark
Lot of tools built on top
- Hive Shark for (SQL style queries), Mahout (ML), Giraph (Graph Processing)
Usecase: Big Data for development
Done using CDR data
People density noon vs. midnight
(red => increased, blue =>
decreased)
Urban Planning
- People distribution
- Mobility
- Waste Management
- E.g. see http://goo.gl/jPujmM
From: http://lirneasia.net/2014/08/what-does-big-data-say-about-sri-lanka/
Value of some Insights degrade Fast!
For some usecases ( e.g. stock markets, traffic, surveillance, patient
monitoring) the value of insights degrades very quickly with time.
- E.g. stock markets and speed of light
We need technology that can produce
outputs fast
- Static Queries, but need very fast output
(Alerts, Realtime control)
- Dynamic and Interactive Queries ( Data
exploration)
Complex Event Processing
Predictive Analytics
 If we know how to solve a problem, that is if we know
a finite set of rules, then we can programs it.
 For some problems (e.g. Drive a car, character
recognition), we do not know a finite fix rule set.
 Instead of programming, we give lot of examples and
ask the computer to learn (often called Machine
Learning)
 Lot of tools
- R ( Statistical language)
- Sci-kit learn (Phython)
- Apache Spark’s MLBase and Apache Mahout (Java)
Usecase: Predictive Maintenance
Idea is to fix the problem before it
broke, avoiding expensive downtimes
- Airplanes, turbines, windmills
- Construction Equipment
- Car, Golf carts
How
- Build a model for normal operation and
compare deviation
- Match against known error patterns
Communicate:
Dashboards
 Idea is to given the “Overall idea” in a glance
(e.g. car dashboard)
 Support for personalization, you can build
your own dashboard.
 Also the entry point for Drill down
 How to build?
- Expose data via JSON
- Build Dashboard via Google Gadget and
content via HTML5 + java scripts (Use
charting libraries like Vega or D3)
Communicate:Alerts andTriggers
Detecting conditions can be done
via Event Processing system ( e.g.
CEP)
Key is the “Last Mile”
- Email
- SMS
- Push notifications to a UI
- Pager
- Trigger physical Alarm
Case Study: Realtime Soccer Analysis
Watch at: https://www.youtube.com/watch?v=nRI6buQ0NOM
Changing Dynamics
Large Observational Datasets
Stats are easy with designed experiments
- You got to select a representative set
- You have a control group
You have lot and lot of data and lot and
lot of computing power ( compared to
what you had)
Two reactions!!
“It is better to be roughly
right than precisely
wrong.”
John Keynes―
In the long run, we
are all Dead!!
Challenges: Causality
 Correlation does not imply Causality!! ( send a book home
example [1])
 Causality
- do repeat experiment with identical test
- If CAN’T do a randomized test (A/B test)
- With Big data we cannot do either
 Option 1: We can act on correlation if we can verify the
guess or if correctness is not critical (Start Investigation,
Check for a disease, Marketing )
 Option 2: We verify correlations using A/B testing or
propensity analysis
[1] http://www.freakonomics.com/2008/12/10/the-blagojevich-upside/
[2] https://hbr.org/2014/03/when-to-act-on-a-correlation-and-when-not-to/
Curious Case of Missing Data
http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from
http://www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/
•WW II, Returned Aircrafts and
data on where they were hit?
•How would you add Armour?
More Data Beat a Clever Algorithm
Observed by large internet
companies
Also seen over keggle
Competitions
E.g. SVM vs. Logistic regression
Read “A Few Useful Things to Know
about Machine Learning” (Pedro
Domingos)
Challenges: Feature Engineering
In ML feature engineering is the key [1].
You need features to form a kernel. Then you can solve with
less data.
Deep learning can learn best feature (combination) via semi
or unsupervised learning [2]
1. Bekkerman’s talk https://www.youtube.com/watch?v=wjTJVhmu1JM
2. Deep Learning, http://cl.naist.jp/~kevinduh/a/deep2014/
Challenges:Taking Decisions (Context)
Challenges: Updating Models
● Incorporate more data
o We get more data over time
o We get feed back about effectiveness
of decisions (e.g. Accuracy of Fraud)
o Trends change
● Track and update model
o Generate models in batch mode and
update
o Streaming (Online) ML, which is an
active research topic
Challenges: Lack of Labeled Data
•Most data is not labeled
•Idea of Semi Supervised learning
•Provide Data + Examples +
Ontology, and algorithm find new
patterns
–Lot of Data
–Few example sentences
•Often uses Expectations
Maximization (EM) Algorithm
Watch Tom Mitchell’s Lecture https://www.youtube.com/watch?v=psFnHkIjHA0
Ontology: People, Cities
Relationships: like,
dislike, live in
Examples: Bob (People)
lives in Colombo (City)
TwoTakeaways
Do your data Processing as part of a Bigger system
- Think Systems, automate, make a difference
- Realtime vs Batch
- Use tools ( Do not reinvent the wheel)
Think how dynamics are changing (Uncontrolled experiments,
lot of Data)
- Do not be a data Pessimist
- However, do not do stupid things either
Questions?

More Related Content

What's hot

What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Simplilearn
 

What's hot (20)

Introduction to data science club
Introduction to data science clubIntroduction to data science club
Introduction to data science club
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.
 
Data visualization
Data visualizationData visualization
Data visualization
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Data science
Data science Data science
Data science
 
Statistics and data science
Statistics and data scienceStatistics and data science
Statistics and data science
 
Machine learning basics
Machine learning basics Machine learning basics
Machine learning basics
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Data science - An Introduction
Data science - An IntroductionData science - An Introduction
Data science - An Introduction
 
Cross validation
Cross validationCross validation
Cross validation
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
Introduction to Data Visualization
Introduction to Data VisualizationIntroduction to Data Visualization
Introduction to Data Visualization
 
Deep learning
Deep learningDeep learning
Deep learning
 
The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science Landscape
 

Viewers also liked

Wso2datasciencesummerschool20151 150714180825-lva1-app6892
Wso2datasciencesummerschool20151 150714180825-lva1-app6892Wso2datasciencesummerschool20151 150714180825-lva1-app6892
Wso2datasciencesummerschool20151 150714180825-lva1-app6892
WSO2
 

Viewers also liked (20)

Data Science Applications
Data Science ApplicationsData Science Applications
Data Science Applications
 
Share Information, Change the World: Big Data, Small Apps, Smart Dashboards &...
Share Information, Change the World: Big Data, Small Apps, Smart Dashboards &...Share Information, Change the World: Big Data, Small Apps, Smart Dashboards &...
Share Information, Change the World: Big Data, Small Apps, Smart Dashboards &...
 
Data science fin_tech_2016
Data science fin_tech_2016Data science fin_tech_2016
Data science fin_tech_2016
 
Big Data + Social Graph
Big Data + Social GraphBig Data + Social Graph
Big Data + Social Graph
 
Smart City Ecosystem, fram data to value for the citizens, Km4City solution, ...
Smart City Ecosystem, fram data to value for the citizens, Km4City solution, ...Smart City Ecosystem, fram data to value for the citizens, Km4City solution, ...
Smart City Ecosystem, fram data to value for the citizens, Km4City solution, ...
 
myRide: A Real-Time Information System for the Carnegie Mellon University Shu...
myRide: A Real-Time Information System for the Carnegie Mellon University Shu...myRide: A Real-Time Information System for the Carnegie Mellon University Shu...
myRide: A Real-Time Information System for the Carnegie Mellon University Shu...
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
Wso2datasciencesummerschool20151 150714180825-lva1-app6892
Wso2datasciencesummerschool20151 150714180825-lva1-app6892Wso2datasciencesummerschool20151 150714180825-lva1-app6892
Wso2datasciencesummerschool20151 150714180825-lva1-app6892
 
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
 
Top Data Science Trends for 2015
Top Data Science Trends for 2015Top Data Science Trends for 2015
Top Data Science Trends for 2015
 
Mapping (big) data science (15 dec2014)대학(원)생
Mapping (big) data science (15 dec2014)대학(원)생Mapping (big) data science (15 dec2014)대학(원)생
Mapping (big) data science (15 dec2014)대학(원)생
 
Real time data services
Real time data servicesReal time data services
Real time data services
 
Data Science ATL Meetup - Risk I/O Security Data Science
Data Science ATL Meetup - Risk I/O Security Data ScienceData Science ATL Meetup - Risk I/O Security Data Science
Data Science ATL Meetup - Risk I/O Security Data Science
 
Banking & Smart City Ecosystem
Banking & Smart City EcosystemBanking & Smart City Ecosystem
Banking & Smart City Ecosystem
 
Real Time Big Data
Real Time Big DataReal Time Big Data
Real Time Big Data
 
SmartCity StreamApp Platform: Real-time Information for Smart Cities and Tran...
SmartCity StreamApp Platform: Real-time Information for Smart Cities and Tran...SmartCity StreamApp Platform: Real-time Information for Smart Cities and Tran...
SmartCity StreamApp Platform: Real-time Information for Smart Cities and Tran...
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
 
Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...
Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...
Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...
 
[2A7]Linkedin'sDataScienceWhyIsItScience
[2A7]Linkedin'sDataScienceWhyIsItScience[2A7]Linkedin'sDataScienceWhyIsItScience
[2A7]Linkedin'sDataScienceWhyIsItScience
 
Pivotal Digital Transformation Forum: Becoming a Data Driven Enterprise
Pivotal Digital Transformation Forum: Becoming a Data Driven EnterprisePivotal Digital Transformation Forum: Becoming a Data Driven Enterprise
Pivotal Digital Transformation Forum: Becoming a Data Driven Enterprise
 

Similar to Data Science in the Real World: Making a Difference

ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
Srinath Perera
 
Partner webinar presentation aws pebble_treasure_data
Partner webinar presentation aws pebble_treasure_dataPartner webinar presentation aws pebble_treasure_data
Partner webinar presentation aws pebble_treasure_data
Treasure Data, Inc.
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
Doug Needham
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
eswcsummerschool
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
Daniel Katz
 
Integrating and publishing public safety data using semantic technologies
Integrating and publishing public safety data using semantic technologiesIntegrating and publishing public safety data using semantic technologies
Integrating and publishing public safety data using semantic technologies
Alvaro Graves
 

Similar to Data Science in the Real World: Making a Difference (20)

ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Machine learning at b.e.s.t. summer university
Machine learning  at b.e.s.t. summer universityMachine learning  at b.e.s.t. summer university
Machine learning at b.e.s.t. summer university
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introduction
 
Industry of Things World - Berlin 19-09-16
Industry of Things World - Berlin 19-09-16Industry of Things World - Berlin 19-09-16
Industry of Things World - Berlin 19-09-16
 
On Big Data
On Big DataOn Big Data
On Big Data
 
[243] turning data into value
[243] turning data into value[243] turning data into value
[243] turning data into value
 
Machine Learning in the Real World
Machine Learning in the Real WorldMachine Learning in the Real World
Machine Learning in the Real World
 
Partner webinar presentation aws pebble_treasure_data
Partner webinar presentation aws pebble_treasure_dataPartner webinar presentation aws pebble_treasure_data
Partner webinar presentation aws pebble_treasure_data
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
 
Integrating and publishing public safety data using semantic technologies
Integrating and publishing public safety data using semantic technologiesIntegrating and publishing public safety data using semantic technologies
Integrating and publishing public safety data using semantic technologies
 
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 

More from Srinath Perera

More from Srinath Perera (20)

Book: Software Architecture and Decision-Making
Book: Software Architecture and Decision-MakingBook: Software Architecture and Decision-Making
Book: Software Architecture and Decision-Making
 
Data science Applications in the Enterprise
Data science Applications in the EnterpriseData science Applications in the Enterprise
Data science Applications in the Enterprise
 
An Introduction to APIs
An Introduction to APIs An Introduction to APIs
An Introduction to APIs
 
An Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance ProfessionalsAn Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance Professionals
 
AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?
 
Healthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & ChallengesHealthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & Challenges
 
How would AI shape Future Integrations?
How would AI shape Future Integrations?How would AI shape Future Integrations?
How would AI shape Future Integrations?
 
The Role of Blockchain in Future Integrations
The Role of Blockchain in Future IntegrationsThe Role of Blockchain in Future Integrations
The Role of Blockchain in Future Integrations
 
Future of Serverless
Future of ServerlessFuture of Serverless
Future of Serverless
 
Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going? Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going?
 
Few thoughts about Future of Blockchain
Few thoughts about Future of BlockchainFew thoughts about Future of Blockchain
Few thoughts about Future of Blockchain
 
A Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New TechnologiesA Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New Technologies
 
Privacy in Bigdata Era
Privacy in Bigdata  EraPrivacy in Bigdata  Era
Privacy in Bigdata Era
 
Blockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and RisksBlockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and Risks
 
Today's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology LandscapeToday's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology Landscape
 
An Emerging Technologies Timeline
An Emerging Technologies TimelineAn Emerging Technologies Timeline
An Emerging Technologies Timeline
 
The Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsThe Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming Applications
 
Analytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the UglyAnalytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the Ugly
 
Transforming a Business Through Analytics
Transforming a Business Through AnalyticsTransforming a Business Through Analytics
Transforming a Business Through Analytics
 
SoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration Technology
 

Recently uploaded

Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 

Recently uploaded (20)

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 

Data Science in the Real World: Making a Difference

  • 1. Data Science in the Real World: Making a Difference Srinath Perera Director Research WSO2, Apache Member (@srinath_perera) srinath@wso2.com StatDay 2015 @ University of Colombo
  • 2. Outline  Making sense of World’s Data  Building Data Systems  Changing Dynamics of Data Analysis with Big Data ( Sensor Data)  Challenges and Open Problems
  • 3. Michael Stonebraker “But then, out of nowhere, some marketing guys started talking about ‘big data, That’s when I realized that I’d been studying this thing for the better part of my academic life.”
  • 4. Michael Stonebraker “But then, out of nowhere, some marketing guys started talking about ‘big data, That’s when I realized that I’d been studying this thing for the better part of my academic life.” ACM Turing Award,
  • 5. A Day inYour Life Think about a day in your life? - What is the best road to take? - Would there be any bad weather? - How to invest my money? - How is my health? There are many decisions that you can do better if only you can access the data and process them. http://www.flickr.com/photos/kcolwell/551246 1652/ CC licence
  • 6.
  • 7. What can We do with Data? Optimize (World is inefficient) - 30% food wasted farm to plate - GE Save 1% initiative (http://goo.gl/eYC0QE ) - Trains => 2B/ year - US healthcare => 20B/ year Save lives - Weather, Disease identification, Personalized treatment Technology advancement - Most high tech research are done via simulations
  • 11. Batch Processing Store and process Slow (> 5 minutes for results for a reasonable usecase) Programming model is MapReduce - Apache Hadoop - Spark Lot of tools built on top - Hive Shark for (SQL style queries), Mahout (ML), Giraph (Graph Processing)
  • 12. Usecase: Big Data for development Done using CDR data People density noon vs. midnight (red => increased, blue => decreased) Urban Planning - People distribution - Mobility - Waste Management - E.g. see http://goo.gl/jPujmM From: http://lirneasia.net/2014/08/what-does-big-data-say-about-sri-lanka/
  • 13. Value of some Insights degrade Fast! For some usecases ( e.g. stock markets, traffic, surveillance, patient monitoring) the value of insights degrades very quickly with time. - E.g. stock markets and speed of light We need technology that can produce outputs fast - Static Queries, but need very fast output (Alerts, Realtime control) - Dynamic and Interactive Queries ( Data exploration)
  • 15. Predictive Analytics  If we know how to solve a problem, that is if we know a finite set of rules, then we can programs it.  For some problems (e.g. Drive a car, character recognition), we do not know a finite fix rule set.  Instead of programming, we give lot of examples and ask the computer to learn (often called Machine Learning)  Lot of tools - R ( Statistical language) - Sci-kit learn (Phython) - Apache Spark’s MLBase and Apache Mahout (Java)
  • 16. Usecase: Predictive Maintenance Idea is to fix the problem before it broke, avoiding expensive downtimes - Airplanes, turbines, windmills - Construction Equipment - Car, Golf carts How - Build a model for normal operation and compare deviation - Match against known error patterns
  • 17. Communicate: Dashboards  Idea is to given the “Overall idea” in a glance (e.g. car dashboard)  Support for personalization, you can build your own dashboard.  Also the entry point for Drill down  How to build? - Expose data via JSON - Build Dashboard via Google Gadget and content via HTML5 + java scripts (Use charting libraries like Vega or D3)
  • 18. Communicate:Alerts andTriggers Detecting conditions can be done via Event Processing system ( e.g. CEP) Key is the “Last Mile” - Email - SMS - Push notifications to a UI - Pager - Trigger physical Alarm
  • 19. Case Study: Realtime Soccer Analysis Watch at: https://www.youtube.com/watch?v=nRI6buQ0NOM
  • 21. Large Observational Datasets Stats are easy with designed experiments - You got to select a representative set - You have a control group You have lot and lot of data and lot and lot of computing power ( compared to what you had) Two reactions!!
  • 22. “It is better to be roughly right than precisely wrong.” John Keynes― In the long run, we are all Dead!!
  • 23. Challenges: Causality  Correlation does not imply Causality!! ( send a book home example [1])  Causality - do repeat experiment with identical test - If CAN’T do a randomized test (A/B test) - With Big data we cannot do either  Option 1: We can act on correlation if we can verify the guess or if correctness is not critical (Start Investigation, Check for a disease, Marketing )  Option 2: We verify correlations using A/B testing or propensity analysis [1] http://www.freakonomics.com/2008/12/10/the-blagojevich-upside/ [2] https://hbr.org/2014/03/when-to-act-on-a-correlation-and-when-not-to/
  • 24. Curious Case of Missing Data http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from http://www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/ •WW II, Returned Aircrafts and data on where they were hit? •How would you add Armour?
  • 25. More Data Beat a Clever Algorithm Observed by large internet companies Also seen over keggle Competitions E.g. SVM vs. Logistic regression Read “A Few Useful Things to Know about Machine Learning” (Pedro Domingos)
  • 26. Challenges: Feature Engineering In ML feature engineering is the key [1]. You need features to form a kernel. Then you can solve with less data. Deep learning can learn best feature (combination) via semi or unsupervised learning [2] 1. Bekkerman’s talk https://www.youtube.com/watch?v=wjTJVhmu1JM 2. Deep Learning, http://cl.naist.jp/~kevinduh/a/deep2014/
  • 28. Challenges: Updating Models ● Incorporate more data o We get more data over time o We get feed back about effectiveness of decisions (e.g. Accuracy of Fraud) o Trends change ● Track and update model o Generate models in batch mode and update o Streaming (Online) ML, which is an active research topic
  • 29. Challenges: Lack of Labeled Data •Most data is not labeled •Idea of Semi Supervised learning •Provide Data + Examples + Ontology, and algorithm find new patterns –Lot of Data –Few example sentences •Often uses Expectations Maximization (EM) Algorithm Watch Tom Mitchell’s Lecture https://www.youtube.com/watch?v=psFnHkIjHA0 Ontology: People, Cities Relationships: like, dislike, live in Examples: Bob (People) lives in Colombo (City)
  • 30. TwoTakeaways Do your data Processing as part of a Bigger system - Think Systems, automate, make a difference - Realtime vs Batch - Use tools ( Do not reinvent the wheel) Think how dynamics are changing (Uncontrolled experiments, lot of Data) - Do not be a data Pessimist - However, do not do stupid things either