SlideShare a Scribd company logo
1 of 36
Download to read offline
INDUSTRIAL DATA SCIENCE
Tuesday 20 August 2013
WHY DO WE NEED BIG DATA?
SIMPLE MODELS AND A LOT OF DATA
TRUMP MORE ELABORATE MODELS
BASED ON LESS DATA
Peter Norvig
“
”
BIG DATA CHANGES PEOPLE AND TECHNOLOGY
• Data changes the management mindset to expect having supporting data
available for all decisions
• Decision making then creates its own data stream that can be analyzed
• Data is an asset: What is its return? Net value? Depreciation? Future
investment plan?
IN GOD WE TRUST,
ALL OTHERS BRING DATA
W.E. Deming
“
”
DEPLOYING DATA
DATA STORAGE FOR LEARNING
• Efficient storage is critical for modeling feasibility
• What is efficient storage depends on data, algorithms and environment
• Memory: working sets, small data, online learning, fast iterations needed
• Disk: M-estimation, local context sufficient
• Data warehouse: simple models in enterprises, complex input generation
• Distributed: stochastic/ensemble methods, large and complex production models
• Cloud: variable workloads, very massive data
UNSUPERVISED LEARNING IN USE
Modeling
Significance
testing
Decision making
As input into
other modeling
Know-how
Selection of
useful pattern
types
DEPLOYMENT OF A COMMON MODEL
Modeling tool
DatabaseService
Prediction request
and answer
Datasets
periodically
for learning
Predictions
written to DB
DEPLOYMENT OF A LOCALIZED MODEL
Modeling tool
DatabaseService
Prediction request
and answer
Datasets
periodically
for learning
Predictions
Data builder
Input
construction
Query
input
DEPLOYMENT OF ONLINE LEARNING
Modeling tool
Database
Incoming
data stream
Service
Data and/or
labels
Requests
with data
Predictions
Data and/or
labels
EVALUATING RESULTS AND QUALITY
• Properly evaluating the quality of modeling results depends on project
objectives, error costs and data specifics
• Classification error makes no sense for skewed class sizes,
ranks and ROC curves do
• Operational improvements evaluated as lift and incremental $$$ over previous
• Uneven error costs:
• earthquake risk estimation
• medical research, molecule potential VS patient safety
• Upsetting recommendations to an e-commerce customer
WHAT IS REAL-TIME?
• Real-time can mean very different things to different people
• Analyst: “What’s the user count today? By source? Now? From France?”
• Sysadmin: “Network traffic up 5x in 5 seconds! What’s going on?”
• Google: “Make a bid for these placements. You have 50 ms”
PROCESSING LARGE DATA
EXAMPLES OF DATA SIZE
Human-generated
• 5K tweets/s
• 25K events/s from a mobile game (that’s 200 GB / day)
• 40K Google searches/s
Machine-generated
• 5M quotes/s in the US options market
• 120 MB/s of diagnostics from a single gas turbine
• 1 PB/s peaking from CERN LHC
HUMAN AND MACHINE GENERATED DATA
• Human-generated data will get more detailed
• … but won’t grow much faster than the underlying userbase
• It will become small eventually
• Machine-generated data will grow by the Moore’s law
• … and it’s already massive
PROCESSING DATA THE OLD WAY
• User actions modify the current state in a transaction DB
• Single events go to an offline audit log for re-running
• Snapshots of data are exported for modeling
• Production models take exports of snapshots,
write back snapshot versioned results
Events
Snapshot
Snapshot
Snapshot
PROCESSING DATA IN STATUS QUO
• Data from operational databases is constantly copied over to a data
warehouse or an analytic database
• This is idealistically a one-stop-shop for all analytics and data science
• Production models preferably work inside the database,
providing high performance and data integrity
• Model learning can try pushing back some operations to the database, but
complex models will need an external tool
• Expensive modeling may require a separate testing database
PROCESSING DATA IN THE CLOUD
• Cloud allows endless scale
• No fixed limits on CPU and data usage, but everything is I/O-bound
• Enterprise hybrid clouds allow testing environments and “cloud bursting”
• Large datasets may require specialized algorithms or retrofits to MapReduce
• Combining stochastic learning, online learning and ensemble methods has
proven itself for the task
PRACTICAL ISSUES
REAL WORLD DATA IS RIDDLED WITH PROBLEMS
• Corrupted incoming data
• Corrupted IDs
• Transient IDs
• Multiple transient IDs without match
• Crazy timestamps
• Data types mixed up
• New variables emerge
• Old variables disappear
• Changes in variable definitions
• And much, much more …
You
Garbage Great insights
AND WITH MORE PROBLEMS
• Collected data is enriched with many operationally attainable sources
⇒ varying schemas and complicated ID soup
• Analytic data often developed by frontline instead of IT waterfall
⇒ faster process, but volatile data definition
• Data scientists asking for more data ⇒ temporary kludges
• Data is big and growing ⇒ risks of unnoticed discontinuity
NO, I’M NOT FINISHED YET
• The data is not a CSV file sitting in your disk
• It’s coming in every second of the year, often gigabytes per hour
• Availability of this data is a business critical issue
• Availability of modeling results is a business critical issue
• Robustness of modeling results is a business critical issue
DATA DRIFT
• Real-world data is rarely stationary
• Equipment ages, people’s preferences change
• Quality of old data models decay
• Training and testing data may need to be specially designed
• Prefer recent data with weights or online learning
ROBUST RESULTS?
• Inputs to a decision making process must be assessed for significance
“Can I trust these numbers? Is my decision justified?”
• Ad-hoc analyses can freely employ complex and bleeding edge modeling
• In operations stability and robustness overrides everything else
• Sanity checks and fallbacks can be used to avoid failures and errors
POWER LAWS
Number of users
Revenue per user
POWER LAWS
• Power laws are ubiquitous in the real world
• Follows from principle: “Whoever has will be given more”
• Example: new links emerge to web pages in proportion to their popularity
• Product improvements can be tracked through changes in the power law curve
• Examples
• Power laws often have a cut-off in the beginning,
not enough mass to fill the lowest ranks
• User engagement and value
• Social network activity
• Brain activity
• Wealth distribution
CONSEQUENCES OF POWER LAWS
• Power laws imply extremely skewed distributions
⇒ most models assume Gaussian or generally more balanced distribution
• Huge mass at the bottom ladder breaks most traditional analyses
• Different parts of the curve have complex real world interaction
• On the other hand it is relatively easy to segment power laws
⇒ separately designed treatment for different target groups
• Bringing new users as part of the power law lifts the whole curve as new
entries slowly diffuse along the curve
THE IMPORTANCE OF PRESENTATION
• Operations or not, visualization is critical for acceptance
• Challenger shuttle disaster linked to poor visualization of O-ring failure risks
• Requires attention from business concept to implementation
• What information do these users want to see ?
• How does this information support decision making ?
• How to visualize it with clarity yet powerfully ?
DATA SCIENCE IN BUSINESS
• Data analysis in business is not the sole task of the data scientist
• The whole organization must gradually mature and engage data
• This is not a technical barrier, it is a human barrier
• How to design business and social processes to employ data?
• Average business has tons of low-hanging data fruit
• Developing and automating all that takes years (and years)
• No use for advanced modeling without visibility to the underlying
WHAT’S COMING UP
PROCESSING DATA IN THE FUTURE
• The event stream itself is increasingly becoming the master input data for
analytics and data solutions
• This is a big sea change, requiring new designs of storage and processing
• Seeing the full timeline and interactions of each object is a mixed blessing
PROS Huge opportunity for discovering significant value
CONS A very complex haystack, needs additional processing, how can a human
focus on the essential?
STREAM PROCESSING
• Instead of handling static states of the data, the data is processed
as it enters the system
• Tables turn: the internal state of the stream persisted to a database
becomes now the backup for failure occasions
• Obvious fit for quickly reactive online learning solutions
• The whole domain was spearheaded by computer trading
• Another example: credit card transaction processing and fraud prevention
HADOOP AND DATA SCIENCE
• Hadoop is a general service platform, not just a MapReduce engine
• HBase is already becoming a hugely popular service backend
• In the long run Hadoop will also host a successful analytic database
• A wide selection of very different approaches to analytics and data science
exists already:
Hive and Pig, Impala, Mahout, Vowpal Wabbit, DataFu, Cloudera ML,
Giraph, RHadoop, …
REARRANGING THE MAP
• Change is not driven by replacing current bad solutions, but by innovating
around their shortcomings
• Stream processing of data will capture a large corner, driven by a sweeping
push closer to real-time
• High-level functional interfaces to data another winner
• Examples: Cascading for batch processing, Trident for stream processing
• Further innovation in fixing MapReduce shortcomings
• Examples: Spark and Shark for iterative tasks, Impala for analytics
THE END

More Related Content

What's hot

Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystemmagda3695
 
Data governance Program PowerPoint Presentation Slides
Data governance Program PowerPoint Presentation Slides Data governance Program PowerPoint Presentation Slides
Data governance Program PowerPoint Presentation Slides SlideTeam
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureLorenzo Nicora
 
How to Structure the Data Organization
How to Structure the Data OrganizationHow to Structure the Data Organization
How to Structure the Data OrganizationRobyn Bollhorst
 
Data Modeling for Big Data
Data Modeling for Big DataData Modeling for Big Data
Data Modeling for Big DataDATAVERSITY
 
Top 10 Best Practices for Implementing Data Classification
Top 10 Best Practices for Implementing Data ClassificationTop 10 Best Practices for Implementing Data Classification
Top 10 Best Practices for Implementing Data ClassificationWatchful Software
 
Developing a Data Strategy
Developing a Data StrategyDeveloping a Data Strategy
Developing a Data StrategyMartha Horler
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
DAS Slides: Data Quality Best Practices
DAS Slides: Data Quality Best PracticesDAS Slides: Data Quality Best Practices
DAS Slides: Data Quality Best PracticesDATAVERSITY
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewSivashankar Ganapathy
 
Emerging Trends in Data Engineering
Emerging Trends in Data EngineeringEmerging Trends in Data Engineering
Emerging Trends in Data EngineeringAnanth PackkilDurai
 
Data Governance
Data GovernanceData Governance
Data GovernanceBoris Otto
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
DAS Slides: Building a Data Strategy - Practical Steps for Aligning with Busi...
DAS Slides: Building a Data Strategy - Practical Steps for Aligning with Busi...DAS Slides: Building a Data Strategy - Practical Steps for Aligning with Busi...
DAS Slides: Building a Data Strategy - Practical Steps for Aligning with Busi...DATAVERSITY
 
Data Quality Management - Data Issue Management & Resolutionn / Practical App...
Data Quality Management - Data Issue Management & Resolutionn / Practical App...Data Quality Management - Data Issue Management & Resolutionn / Practical App...
Data Quality Management - Data Issue Management & Resolutionn / Practical App...Burak S. Arikan
 
Data Management vs Data Strategy
Data Management vs Data StrategyData Management vs Data Strategy
Data Management vs Data StrategyDATAVERSITY
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Convincing Stakeholders Data Governance Is Essential
Convincing Stakeholders Data Governance Is EssentialConvincing Stakeholders Data Governance Is Essential
Convincing Stakeholders Data Governance Is EssentialDATAVERSITY
 

What's hot (20)

Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystem
 
Data governance Program PowerPoint Presentation Slides
Data governance Program PowerPoint Presentation Slides Data governance Program PowerPoint Presentation Slides
Data governance Program PowerPoint Presentation Slides
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and Future
 
How to Structure the Data Organization
How to Structure the Data OrganizationHow to Structure the Data Organization
How to Structure the Data Organization
 
Data Modeling for Big Data
Data Modeling for Big DataData Modeling for Big Data
Data Modeling for Big Data
 
Top 10 Best Practices for Implementing Data Classification
Top 10 Best Practices for Implementing Data ClassificationTop 10 Best Practices for Implementing Data Classification
Top 10 Best Practices for Implementing Data Classification
 
Developing a Data Strategy
Developing a Data StrategyDeveloping a Data Strategy
Developing a Data Strategy
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
DAS Slides: Data Quality Best Practices
DAS Slides: Data Quality Best PracticesDAS Slides: Data Quality Best Practices
DAS Slides: Data Quality Best Practices
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Emerging Trends in Data Engineering
Emerging Trends in Data EngineeringEmerging Trends in Data Engineering
Emerging Trends in Data Engineering
 
Data Governance
Data GovernanceData Governance
Data Governance
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
DAS Slides: Building a Data Strategy - Practical Steps for Aligning with Busi...
DAS Slides: Building a Data Strategy - Practical Steps for Aligning with Busi...DAS Slides: Building a Data Strategy - Practical Steps for Aligning with Busi...
DAS Slides: Building a Data Strategy - Practical Steps for Aligning with Busi...
 
Data Quality Management - Data Issue Management & Resolutionn / Practical App...
Data Quality Management - Data Issue Management & Resolutionn / Practical App...Data Quality Management - Data Issue Management & Resolutionn / Practical App...
Data Quality Management - Data Issue Management & Resolutionn / Practical App...
 
Data Quality Presentation
Data Quality PresentationData Quality Presentation
Data Quality Presentation
 
Data Management vs Data Strategy
Data Management vs Data StrategyData Management vs Data Strategy
Data Management vs Data Strategy
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Convincing Stakeholders Data Governance Is Essential
Convincing Stakeholders Data Governance Is EssentialConvincing Stakeholders Data Governance Is Essential
Convincing Stakeholders Data Governance Is Essential
 

Similar to Industrial Data Science

Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoalarsgeorge
 
Big Data Rampage
Big Data RampageBig Data Rampage
Big Data RampageNiko Vuokko
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeSaurabh K. Gupta
 
Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1GurinderG
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
 
Tdwi solution spotlight presentation slides
Tdwi solution spotlight   presentation slidesTdwi solution spotlight   presentation slides
Tdwi solution spotlight presentation slidesWilliam Lam
 
Agility for big data
Agility for big data Agility for big data
Agility for big data Charlie Cheng
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Denodo
 
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015 Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015 Vladi Vexler
 
Level Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationLevel Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationDoug Denton
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigManish Chopra
 
Big Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data ManagementBig Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data ManagementTony Bain
 
Anatomy of a Big Data Application (BDA)
Anatomy of a Big Data Application (BDA)Anatomy of a Big Data Application (BDA)
Anatomy of a Big Data Application (BDA)BloomReach
 
The Death of the Star Schema
The Death of the Star SchemaThe Death of the Star Schema
The Death of the Star SchemaDATAVERSITY
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - IntroductionTomy Rhymond
 
Observability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyObservability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyTimetrix
 
Big Data Analytics Materials, Chapter: 1
Big Data Analytics Materials, Chapter: 1Big Data Analytics Materials, Chapter: 1
Big Data Analytics Materials, Chapter: 1RUHULAMINHAZARIKA
 
The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationDATAVERSITY
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causationPeter Varhol
 

Similar to Industrial Data Science (20)

Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
Big Data Rampage
Big Data RampageBig Data Rampage
Big Data Rampage
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Tdwi solution spotlight presentation slides
Tdwi solution spotlight   presentation slidesTdwi solution spotlight   presentation slides
Tdwi solution spotlight presentation slides
 
Agility for big data
Agility for big data Agility for big data
Agility for big data
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
 
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015 Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015
 
Level Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationLevel Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentation
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-Koenig
 
Big Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data ManagementBig Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data Management
 
Anatomy of a Big Data Application (BDA)
Anatomy of a Big Data Application (BDA)Anatomy of a Big Data Application (BDA)
Anatomy of a Big Data Application (BDA)
 
The Death of the Star Schema
The Death of the Star SchemaThe Death of the Star Schema
The Death of the Star Schema
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Observability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyObservability – the good, the bad, and the ugly
Observability – the good, the bad, and the ugly
 
Big Data Analytics Materials, Chapter: 1
Big Data Analytics Materials, Chapter: 1Big Data Analytics Materials, Chapter: 1
Big Data Analytics Materials, Chapter: 1
 
Big Data Boom
Big Data BoomBig Data Boom
Big Data Boom
 
The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data Integration
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causation
 

More from Niko Vuokko

Analytics in business
Analytics in businessAnalytics in business
Analytics in businessNiko Vuokko
 
Drones in real use
Drones in real useDrones in real use
Drones in real useNiko Vuokko
 
Analytiikka bisneksessä
Analytiikka bisneksessäAnalytiikka bisneksessä
Analytiikka bisneksessäNiko Vuokko
 
Sensor Data in Business
Sensor Data in BusinessSensor Data in Business
Sensor Data in BusinessNiko Vuokko
 
Sensoridatan ja liiketoiminnan tulevaisuus
Sensoridatan ja liiketoiminnan tulevaisuusSensoridatan ja liiketoiminnan tulevaisuus
Sensoridatan ja liiketoiminnan tulevaisuusNiko Vuokko
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceNiko Vuokko
 
Metrics @ App Academy
Metrics @ App AcademyMetrics @ App Academy
Metrics @ App AcademyNiko Vuokko
 

More from Niko Vuokko (7)

Analytics in business
Analytics in businessAnalytics in business
Analytics in business
 
Drones in real use
Drones in real useDrones in real use
Drones in real use
 
Analytiikka bisneksessä
Analytiikka bisneksessäAnalytiikka bisneksessä
Analytiikka bisneksessä
 
Sensor Data in Business
Sensor Data in BusinessSensor Data in Business
Sensor Data in Business
 
Sensoridatan ja liiketoiminnan tulevaisuus
Sensoridatan ja liiketoiminnan tulevaisuusSensoridatan ja liiketoiminnan tulevaisuus
Sensoridatan ja liiketoiminnan tulevaisuus
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Metrics @ App Academy
Metrics @ App AcademyMetrics @ App Academy
Metrics @ App Academy
 

Recently uploaded

Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Deliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceDeliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceOpsTree solutions
 
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdf
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdfHCI Lesson 1 - Introduction to Human-Computer Interaction.pdf
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdfROWELL MARQUINA
 
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024BookNet Canada
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Dynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientationDynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientationBuild Intuit
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
A PowerPoint Presentation on Vikram Lander pptx
A PowerPoint Presentation on Vikram Lander pptxA PowerPoint Presentation on Vikram Lander pptx
A PowerPoint Presentation on Vikram Lander pptxatharvdev2010
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfROWELL MARQUINA
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Dublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxDublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxKunal Gupta
 

Recently uploaded (20)

Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Deliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceDeliver Latency Free Customer Experience
Deliver Latency Free Customer Experience
 
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdf
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdfHCI Lesson 1 - Introduction to Human-Computer Interaction.pdf
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdf
 
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Dynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientationDynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientation
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
A PowerPoint Presentation on Vikram Lander pptx
A PowerPoint Presentation on Vikram Lander pptxA PowerPoint Presentation on Vikram Lander pptx
A PowerPoint Presentation on Vikram Lander pptx
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Dublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxDublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptx
 

Industrial Data Science

  • 2. WHY DO WE NEED BIG DATA?
  • 3. SIMPLE MODELS AND A LOT OF DATA TRUMP MORE ELABORATE MODELS BASED ON LESS DATA Peter Norvig “ ”
  • 4. BIG DATA CHANGES PEOPLE AND TECHNOLOGY • Data changes the management mindset to expect having supporting data available for all decisions • Decision making then creates its own data stream that can be analyzed • Data is an asset: What is its return? Net value? Depreciation? Future investment plan?
  • 5. IN GOD WE TRUST, ALL OTHERS BRING DATA W.E. Deming “ ”
  • 7. DATA STORAGE FOR LEARNING • Efficient storage is critical for modeling feasibility • What is efficient storage depends on data, algorithms and environment • Memory: working sets, small data, online learning, fast iterations needed • Disk: M-estimation, local context sufficient • Data warehouse: simple models in enterprises, complex input generation • Distributed: stochastic/ensemble methods, large and complex production models • Cloud: variable workloads, very massive data
  • 8. UNSUPERVISED LEARNING IN USE Modeling Significance testing Decision making As input into other modeling Know-how Selection of useful pattern types
  • 9. DEPLOYMENT OF A COMMON MODEL Modeling tool DatabaseService Prediction request and answer Datasets periodically for learning Predictions written to DB
  • 10. DEPLOYMENT OF A LOCALIZED MODEL Modeling tool DatabaseService Prediction request and answer Datasets periodically for learning Predictions Data builder Input construction Query input
  • 11. DEPLOYMENT OF ONLINE LEARNING Modeling tool Database Incoming data stream Service Data and/or labels Requests with data Predictions Data and/or labels
  • 12. EVALUATING RESULTS AND QUALITY • Properly evaluating the quality of modeling results depends on project objectives, error costs and data specifics • Classification error makes no sense for skewed class sizes, ranks and ROC curves do • Operational improvements evaluated as lift and incremental $$$ over previous • Uneven error costs: • earthquake risk estimation • medical research, molecule potential VS patient safety • Upsetting recommendations to an e-commerce customer
  • 13. WHAT IS REAL-TIME? • Real-time can mean very different things to different people • Analyst: “What’s the user count today? By source? Now? From France?” • Sysadmin: “Network traffic up 5x in 5 seconds! What’s going on?” • Google: “Make a bid for these placements. You have 50 ms”
  • 15. EXAMPLES OF DATA SIZE Human-generated • 5K tweets/s • 25K events/s from a mobile game (that’s 200 GB / day) • 40K Google searches/s Machine-generated • 5M quotes/s in the US options market • 120 MB/s of diagnostics from a single gas turbine • 1 PB/s peaking from CERN LHC
  • 16. HUMAN AND MACHINE GENERATED DATA • Human-generated data will get more detailed • … but won’t grow much faster than the underlying userbase • It will become small eventually • Machine-generated data will grow by the Moore’s law • … and it’s already massive
  • 17. PROCESSING DATA THE OLD WAY • User actions modify the current state in a transaction DB • Single events go to an offline audit log for re-running • Snapshots of data are exported for modeling • Production models take exports of snapshots, write back snapshot versioned results Events Snapshot Snapshot Snapshot
  • 18. PROCESSING DATA IN STATUS QUO • Data from operational databases is constantly copied over to a data warehouse or an analytic database • This is idealistically a one-stop-shop for all analytics and data science • Production models preferably work inside the database, providing high performance and data integrity • Model learning can try pushing back some operations to the database, but complex models will need an external tool • Expensive modeling may require a separate testing database
  • 19. PROCESSING DATA IN THE CLOUD • Cloud allows endless scale • No fixed limits on CPU and data usage, but everything is I/O-bound • Enterprise hybrid clouds allow testing environments and “cloud bursting” • Large datasets may require specialized algorithms or retrofits to MapReduce • Combining stochastic learning, online learning and ensemble methods has proven itself for the task
  • 21. REAL WORLD DATA IS RIDDLED WITH PROBLEMS • Corrupted incoming data • Corrupted IDs • Transient IDs • Multiple transient IDs without match • Crazy timestamps • Data types mixed up • New variables emerge • Old variables disappear • Changes in variable definitions • And much, much more … You Garbage Great insights
  • 22. AND WITH MORE PROBLEMS • Collected data is enriched with many operationally attainable sources ⇒ varying schemas and complicated ID soup • Analytic data often developed by frontline instead of IT waterfall ⇒ faster process, but volatile data definition • Data scientists asking for more data ⇒ temporary kludges • Data is big and growing ⇒ risks of unnoticed discontinuity
  • 23. NO, I’M NOT FINISHED YET • The data is not a CSV file sitting in your disk • It’s coming in every second of the year, often gigabytes per hour • Availability of this data is a business critical issue • Availability of modeling results is a business critical issue • Robustness of modeling results is a business critical issue
  • 24. DATA DRIFT • Real-world data is rarely stationary • Equipment ages, people’s preferences change • Quality of old data models decay • Training and testing data may need to be specially designed • Prefer recent data with weights or online learning
  • 25. ROBUST RESULTS? • Inputs to a decision making process must be assessed for significance “Can I trust these numbers? Is my decision justified?” • Ad-hoc analyses can freely employ complex and bleeding edge modeling • In operations stability and robustness overrides everything else • Sanity checks and fallbacks can be used to avoid failures and errors
  • 26. POWER LAWS Number of users Revenue per user
  • 27. POWER LAWS • Power laws are ubiquitous in the real world • Follows from principle: “Whoever has will be given more” • Example: new links emerge to web pages in proportion to their popularity • Product improvements can be tracked through changes in the power law curve • Examples • Power laws often have a cut-off in the beginning, not enough mass to fill the lowest ranks • User engagement and value • Social network activity • Brain activity • Wealth distribution
  • 28. CONSEQUENCES OF POWER LAWS • Power laws imply extremely skewed distributions ⇒ most models assume Gaussian or generally more balanced distribution • Huge mass at the bottom ladder breaks most traditional analyses • Different parts of the curve have complex real world interaction • On the other hand it is relatively easy to segment power laws ⇒ separately designed treatment for different target groups • Bringing new users as part of the power law lifts the whole curve as new entries slowly diffuse along the curve
  • 29. THE IMPORTANCE OF PRESENTATION • Operations or not, visualization is critical for acceptance • Challenger shuttle disaster linked to poor visualization of O-ring failure risks • Requires attention from business concept to implementation • What information do these users want to see ? • How does this information support decision making ? • How to visualize it with clarity yet powerfully ?
  • 30. DATA SCIENCE IN BUSINESS • Data analysis in business is not the sole task of the data scientist • The whole organization must gradually mature and engage data • This is not a technical barrier, it is a human barrier • How to design business and social processes to employ data? • Average business has tons of low-hanging data fruit • Developing and automating all that takes years (and years) • No use for advanced modeling without visibility to the underlying
  • 32. PROCESSING DATA IN THE FUTURE • The event stream itself is increasingly becoming the master input data for analytics and data solutions • This is a big sea change, requiring new designs of storage and processing • Seeing the full timeline and interactions of each object is a mixed blessing PROS Huge opportunity for discovering significant value CONS A very complex haystack, needs additional processing, how can a human focus on the essential?
  • 33. STREAM PROCESSING • Instead of handling static states of the data, the data is processed as it enters the system • Tables turn: the internal state of the stream persisted to a database becomes now the backup for failure occasions • Obvious fit for quickly reactive online learning solutions • The whole domain was spearheaded by computer trading • Another example: credit card transaction processing and fraud prevention
  • 34. HADOOP AND DATA SCIENCE • Hadoop is a general service platform, not just a MapReduce engine • HBase is already becoming a hugely popular service backend • In the long run Hadoop will also host a successful analytic database • A wide selection of very different approaches to analytics and data science exists already: Hive and Pig, Impala, Mahout, Vowpal Wabbit, DataFu, Cloudera ML, Giraph, RHadoop, …
  • 35. REARRANGING THE MAP • Change is not driven by replacing current bad solutions, but by innovating around their shortcomings • Stream processing of data will capture a large corner, driven by a sweeping push closer to real-time • High-level functional interfaces to data another winner • Examples: Cascading for batch processing, Trident for stream processing • Further innovation in fixing MapReduce shortcomings • Examples: Spark and Shark for iterative tasks, Impala for analytics