Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
The Evolution of Data Analytics
about:
how to grok data with machines
and keep up with changing times
The origins (40s, 50s, 60s)
Operation Research during World War II
First Predictive Weather Model on ENIAC
The origins (40s, 50s, 60s)
● Operational Research
● Collision loss vs Anti-Aircraft loss
● Optimization (Statistical) pro...
The origins (40s, 50s, 60s)
● ENIAC predicting weather
● Barometric equations
● 24 hours compute time (mostly manual work)
Analytics goes Mainstream
(70s, 80s)
● The Relational Database is born!
1972: E.F. Codd relational database model, normali...
● 1982: IBM DB2, Oracle v3, Sybase (SAP)
● 1986: First standardized SQL
● 1987: Commercial use of Decision Support Systems...
http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/system360/impacts/
Exploratory Data Analysis
In 1977, Tukey published Exploratory Data Analysis,
arguing that more emphasis needed to be plac...
The Internet goes Global
(90s)
● 1995: Amazon
● 1995: eBay
● 1996: HotMail
● 1998: Google
● 1998: Paypal
Knowledge Data in Databases (1996)
Knowledge Data in Databases (1996)
What is all the excitement about? This article provides an overview of
this emerging fi...
The Internet goes Global
(90s)
● Analytics (OLAP):
Long queries, aggregations, data mining, reporting, models
● Operations...
Data warehouses and ETLs (90s)
● Building the Data Warehouse by
William Inmon (John Wiley - QED,
1992)
The World goes Social
(00s)
Web apps go in hyper - growth
● 2003: LinkedIn
● 2003: Skype
● 2004: Facebook
● 2006: Twitter
The advent of MPP OLAPs (Early 00s)
● Massive multi-rack systems
● 100’s of Computing Cores
● 100’s Terabytes of Storage
●...
● Vertica (HP)
● Greenplum (Pivotal)
● Netezza (IBM)
● Exadata (Oracle)
● Exasol (Exasol)
The advent of MPP OLAPs (Early 0...
Map-Reduce and Hadoop (Early 00s)
● Simpler programming paradigm
● Distributed, Replicated File System
Map-Reduce and Hadoop (Early 00s)
Hadoop or MPPs or both?
Hadoop and MPPs (00s)
● MPP
for speed and accuracy,
well structured data
● Hadoop
for size, flexibility, raw files
http://flowingdata.com/2009/06/04/rise-of-the-data-scientist/
http://medriscoll.com/post/4740157098/the-three-sexy-skills-...
Fast Data, APIs, Mobile and IoT (10s)
● WhatsApp: in a day
● 31 billion messages sent
● 700 million photo’s sent
Fast Data, APIs, Mobile and IoT (10s)
New Problems:
● Hadoop is too slow (File -> File)
● Productivity of Data Science goe...
Streaming and Real-Time Analytics (10s)
The RAM is the new Disk (10s)
Spark is a new framework for in-memory computing
Unify in a Distributed Computing paradigm:
...
Spark
Generality
Combine SQL, streaming, and
complex analytics.
Runs Everywhere
Spark runs on Hadoop, Mesos,
standalone, o...
Popular Analytical Stacks (10s)
Hadoop Hive + MPP
Spark + Cassandra (no Hadoop!)
Spark + HDFS + Elastic(Search)
Future (10s, 20s)
Micro-Batch and Event Streaming Analytics
- Micro-Batch (Spark Streaming)
- Log Oriented (Kafka, Samza)
...
Takeaways
1) SQL is there to stay
2) Data Science must be easy to program
3) Memory is King
4) Spark is the new Hadoop
The evolution of data analytics
The evolution of data analytics
The evolution of data analytics
The evolution of data analytics
The evolution of data analytics
Upcoming SlideShare
Loading in …5
×

The evolution of data analytics

5,102 views

Published on

In the past decade a number of technologies have revolutionized the way we do analytics in banking. In this talk we would like to summarize this journey from classical statistical offline modeling to the latest real-time streaming predictive analytical techniques.

In particular, we will look at hadoop and how this distributing computing paradigm has evolved with the advent of in-memory computing. We will introduce Spark, an engine for large-scale data processing optimized for in-memory computing.

Finally, we will describe how to make data science actionable and how to overcome some of the limitations of current batch processing with streaming analytics.

Published in: Data & Analytics

The evolution of data analytics

  1. 1. The Evolution of Data Analytics
  2. 2. about: how to grok data with machines and keep up with changing times
  3. 3. The origins (40s, 50s, 60s) Operation Research during World War II First Predictive Weather Model on ENIAC
  4. 4. The origins (40s, 50s, 60s) ● Operational Research ● Collision loss vs Anti-Aircraft loss ● Optimization (Statistical) problems ● Scheduling and resource allocation
  5. 5. The origins (40s, 50s, 60s) ● ENIAC predicting weather ● Barometric equations ● 24 hours compute time (mostly manual work)
  6. 6. Analytics goes Mainstream (70s, 80s) ● The Relational Database is born! 1972: E.F. Codd relational database model, normalization: (free from insertion, deletion and update anomalies) 1978: Peter Chen, The entity-relationship model
  7. 7. ● 1982: IBM DB2, Oracle v3, Sybase (SAP) ● 1986: First standardized SQL ● 1987: Commercial use of Decision Support Systems: Texas Air Traffic Expert system Analytics goes Mainstream (70s, 80s)
  8. 8. http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/system360/impacts/
  9. 9. Exploratory Data Analysis In 1977, Tukey published Exploratory Data Analysis, arguing that more emphasis needed to be placed on using data to suggest hypotheses to test and that Exploratory Data Analysis and Confirmatory Data Analysis “can—and should—proceed side by side.” Analytics goes Mainstream (70s, 80s)
  10. 10. The Internet goes Global (90s) ● 1995: Amazon ● 1995: eBay ● 1996: HotMail ● 1998: Google ● 1998: Paypal
  11. 11. Knowledge Data in Databases (1996)
  12. 12. Knowledge Data in Databases (1996) What is all the excitement about? This article provides an overview of this emerging field, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases. AI Magazine Volume 17 Number 3 (1996) (© AAAI) http://www.aaai.org/ojs/index.php/aimagazine/article/view/1230/1131
  13. 13. The Internet goes Global (90s) ● Analytics (OLAP): Long queries, aggregations, data mining, reporting, models ● Operations (OLTP): Fast transactions, ACID, consistent, available, fault-tolerant
  14. 14. Data warehouses and ETLs (90s) ● Building the Data Warehouse by William Inmon (John Wiley - QED, 1992)
  15. 15. The World goes Social (00s) Web apps go in hyper - growth ● 2003: LinkedIn ● 2003: Skype ● 2004: Facebook ● 2006: Twitter
  16. 16. The advent of MPP OLAPs (Early 00s) ● Massive multi-rack systems ● 100’s of Computing Cores ● 100’s Terabytes of Storage ● Distributed computing ● Advanced Query Plans ● Columnar Data Models ● Re-programmable hardware
  17. 17. ● Vertica (HP) ● Greenplum (Pivotal) ● Netezza (IBM) ● Exadata (Oracle) ● Exasol (Exasol) The advent of MPP OLAPs (Early 00s)
  18. 18. Map-Reduce and Hadoop (Early 00s) ● Simpler programming paradigm ● Distributed, Replicated File System
  19. 19. Map-Reduce and Hadoop (Early 00s)
  20. 20. Hadoop or MPPs or both?
  21. 21. Hadoop and MPPs (00s) ● MPP for speed and accuracy, well structured data ● Hadoop for size, flexibility, raw files
  22. 22. http://flowingdata.com/2009/06/04/rise-of-the-data-scientist/ http://medriscoll.com/post/4740157098/the-three-sexy-skills-of-data-geeks The rise of the data scientist (late 00s)
  23. 23. Fast Data, APIs, Mobile and IoT (10s) ● WhatsApp: in a day ● 31 billion messages sent ● 700 million photo’s sent
  24. 24. Fast Data, APIs, Mobile and IoT (10s) New Problems: ● Hadoop is too slow (File -> File) ● Productivity of Data Science goes down ● SQL is not enough ● Distributed Machine Learning algorithms?
  25. 25. Streaming and Real-Time Analytics (10s)
  26. 26. The RAM is the new Disk (10s) Spark is a new framework for in-memory computing Unify in a Distributed Computing paradigm: SQL, Machine Learning, Map-Reduce, Graph Analytics
  27. 27. Spark Generality Combine SQL, streaming, and complex analytics. Runs Everywhere Spark runs on Hadoop, Mesos, standalone, or in the cloud. Multiple Data Sources It can access diverse data sources including HDFS, Cassandra, HBase, and S3. https://spark.apache.org/
  28. 28. Popular Analytical Stacks (10s) Hadoop Hive + MPP Spark + Cassandra (no Hadoop!) Spark + HDFS + Elastic(Search)
  29. 29. Future (10s, 20s) Micro-Batch and Event Streaming Analytics - Micro-Batch (Spark Streaming) - Log Oriented (Kafka, Samza) - NewSQL (VoldDB)
  30. 30. Takeaways 1) SQL is there to stay 2) Data Science must be easy to program 3) Memory is King 4) Spark is the new Hadoop

×