Big Data in Texas: Then, Now, and Ahead

“Big Data in Texas:
Then, Now, and Ahead”

Paco Nathan,
Evil Mad Scientist @
Concurrent, Inc.

1

Then, Now, and Ahead

THEN
1. Keep Austin Weird?
2. Something Called Data Science
3. Rise Of The Machine Data
4. A Cambrian Explosion
5. Eat, Drink, Be Merry…
6. Data-Driven In TX
7. Roll Up Your Sleeves

2

observations…

Lynn asked me to talk about Data here today
A few weeks ago we stepped back for a moment
to reﬂect about what we’d seen happen in Austin
over the years
Both of us ran alternative bookstores in Austin,
twenty or so years ago, and we participated as
the Internet thing exploded in the 1990s
That was a blast –

3

observations…

We noticed a trend
Thinking about some of those who kept
showing up whenever interesting things
were afoot…

8

“curation and metadata”

10

observations…

Overall, it’s about systems thinking
We have a wealth of that here, at UT/Austin in particular…
Ilya Prigogine spent years here, which is just incredible
School of Architecture, with leading work in VR, GIS, etc.
Interactive innovations at ACTLab…
Quantitative emphasis at McCombs…
major intellectual resources here

11


NOW

12

Data Science edoMpUsserD:IUN
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
taeS egnahC

business process,
wodniW D3 nepO

Domain dneirF ddA
revO tcudorP pilF lenaP yrotnevnI tneilC
lenaP tidE

Expert
woN tahC

stakeholder
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
teP weN etaerC

data detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC

science egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL

data prep, discovery,
noitartsigeR euqinU

Data

edoMpUsserD:IUN
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
taeS egnahC

dneirF ddA
revO tcudorP pilF lenaP yrotnevnI tneilC
lenaP tidE
woN tahC
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU
wodniW D3 nepO
Scientist modeling, etc.

software engineering,
App Dev
automation

Ops
systems engineering,
availability

introduced
capability

13

Data Science in Texas…

14

references…

by DJ Patil

Data Jujitsu
O’Reilly, 2012
amazon.com/dp/B008HMN5BE

Building Data Science Teams
O’Reilly, 2011
amazon.com/dp/B005O4U3ZE

15

Enterprise Data Workflows

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

cascading.org

16

Enterprise Data Workflows

Over the past 5+ years, we’ve seen many large-
scale Enterprise production deployments based
on Cascading, Cascalog, Scalding, PyCascading,
Cascading.JRuby, etc.
Enterprise data workﬂows,
Machine learning at scale,
Big Data…
Why?

amazon.com/dp/1449358721
17


NOW

18

Three broad categories of data
Curt Monash, 2010
dbms2.com/2010/01/17/three-broad-categories-of-data

• Human/Tabular data – human-generated data which ﬁts well into tables/arrays

• Human/Nontabular data – all other data generated by humans

• Machine-Generated data

19

Three broad categories of data
Curt Monash, 2010
dbms2.com/2010/01/17/three-broad-categories-of-data

• Human/Tabular data – human-generated data which ﬁts well into tables/arrays

• Human/Nontabular data – all other data generated by humans

• Machine-Generated data

• Adjusted Data – Dr. Don Easterbrook, Senate witness

20

Q3 1997: inﬂection point

Four independent teams were working toward horizontal
scale-out of workﬂows based on commodity hardware
This effort prepared the way for huge Internet successes
in the 1997 holiday season… AMZN, EBAY, Inktomi
(YHOO Search), then GOOG

MapReduce and the Apache Hadoop open source stack
emerged from this

21

Circa 1996: pre- inﬂection point

Stakeholder Customers

Excel pivot tables
PowerPoint slide decks strategy

BI
Product
Analysts

requirements

SQL Query optimized
Engineering code Web App
result sets

transactions

RDBMS

22

Circa 1996: pre- inﬂection point

Stakeholder Customers

Excel pivot tables
PowerPoint slide decks strategy

“Throw it over the wall”
BI
Product
Analysts

requirements

SQL Query optimized
Engineering code Web App
result sets

transactions

RDBMS

23

Circa 2001: post- big ecommerce successes

Stakeholder Product Customers

dashboards UX
Engineering

models servlets

recommenders
Algorithmic + Web Apps
Modeling classiﬁers

Middleware
aggregation
event
SQL Query history
result sets customer
transactions
Logs

DW ETL RDBMS

24

Circa 2001: post- big ecommerce successes

Stakeholder Product Customers

“Data products”
dashboards UX
Engineering

models servlets

recommenders
Algorithmic + Web Apps
Modeling classiﬁers

Middleware
aggregation
event
SQL Query history
result sets customer
transactions
Logs

DW ETL RDBMS

25

Circa 2013: clusters everywhere

Data Products Customers
business
Domain process Prod
Expert Workﬂow
dashboard
metrics
data
Web Apps, s/w
History services
science Mobile, etc. dev
Data
Scientist
Planner social
discovery interactions
+ optimized transactions,
Eng
modeling taps capacity content

App Dev
Use Cases Across Topologies

Hadoop, Log In-Memory
etc. Events Data Grid
Ops DW Ops
batch near time

Cluster Scheduler
introduced existing
capability SDLC

RDBMS
RDBMS

26

Circa 2013: clusters everywhere

Data Products Customers
business
Domain process Prod
Expert Workﬂow
dashboard
metrics
data
Web Apps, s/w
History services
science Mobile, etc. dev
Data
Scientist
Planner social
discovery interactions
+ optimized transactions,
Eng
modeling taps capacity content

App Dev

“Optimizing topologies”
Use Cases Across Topologies

Hadoop, Log In-Memory
etc. Events Data Grid
Ops DW Ops
batch near time

Cluster Scheduler
introduced existing
capability SDLC

RDBMS
RDBMS

27

references…

• Lambda Architecture: blending topologies
• Big Data by Nathan Marz, James Warren
• manning.com/marz

source: Nathan Marz

28

references…

by Leo Breiman
Statistical Modeling: The Two Cultures
Statistical Science, 2001
bit.ly/eUTh9L

29

references…

Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtube.com/watch?v=E91oEn1bnXM

Google
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtube.com/watch?v=qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx

30


NOW

31

Displacement

Geoffrey Moore
Mohr Davidow Ventures, author of Crossing The Chasm
Hadoop Summit, 2012:

what Amazon did to the retail sector… has put the
entire Global 1000 on notice over the next decade
data as the major force… mostly through apps –
verticals, leveraging domain expertise

Michael Stonebraker
INGRES, PostgreSQL,Vertica,VoltDB, Paradigm4, etc.
XLDB, 2012:

complex analytics workloads are now displacing
SQL as the basis for Enterprise apps

32

Drivers

algorithmic modeling + machine data
+ curation, metadata + Open Data
data products, as feedback into automation
evolution of feedback loops

a big part of the science in data science…

internet of things + complex analytics
accelerated evolution, additional feedback loops

taking this out into a highly social dimension

33

“A kind of Cambrian explosion”

source: National Geographic

34

Internet of Things

35

A Thought Exercise

Consider that when a company like Catepillar moves
into data science, they won’t be building the world’s
next search engine or social network
They will most likely be optimizing supply chain,
optimizing fuel costs, automating data feedback
loops integrated into their equipment…
Operations Research –
crunching amazing amounts of data

36

A Thought Exercise

That’s a $50B company,
in a market segment worth $250B
Upcoming: tractors as drones –
guided by complex, distributed data apps

37

Alternatively…

climate.com

38

Two Avenues to the App Layer

Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments

complexity ➞
Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space scale ➞
to compete using relatively lean staff

39


AHEAD

40

For instance…

Let’s drill-down on that intersection of tractors and
crops, as a focus…
Some of the largest use cases for large-scale data
workﬂows which we encounter are in Agriculture

Here’s a sector which integrates some of those
themes from the Internet of Things, Catepillar,
Climate Corp, etc.

41

Data and Agriculture, Ahead

• single largest employer, livelihood for 40% globally
• 500 million small farms worldwide
• most family farmers rely on rain-fed agriculture

• approx $2T agricultural real estate in US alone
• high annual rate of soil depletion
• cycles of flooding, drought, desertification

• high resolution from private satellite networks,
e.g., skyboximaging.com
• SMS networks for “business intelligence” among
family farmers in Ethiopia agrepedia.com
• microfinance, e.g., kiva.org, slowmoney.org

42


Consider the emerging reality of drone tractors,
guided by satellite feeds, with predictive analytics
accessing remote cloud-based clusters, crunching
data for crops planted per-plot, based on years of
history evaluated in time series analysis
It would be difﬁcult to identify a bigger Big Data
problem in the world

43


You’ve heard about Peak Oil, Peak Phosphorus?
How about Peak Snow?

In other words, rising variance of snow pack levels,
increasingly earlier peak snow in the mountains…
which stresses the watersheds, infrastructure, etc.,
which in turn stress agriculture, energy, transportation,
ﬁnancial markets, tax basis, etc.
Jeff Dozier, William Gail
“The Emerging Science of Environmental Applications”
The Fourth Paradigm, 2009

source: J. Dozier, et al., UCSB

44


Variance in the timing of the water cycle causes
stress on natural resources and infrastructure:
reservoirs, aqueducts, river ways, aquifers, levees,
farm lands, seawater incursion, etc.
Even in the face of so much IoT data looming,
we lack adequate data and modeling of snowpack,
snow melt, runoff, evaporation, water basins, etc.,
to understand the impact of these changes – now
needed to forecast where to change infrastructure
or strategies
There’s not much machine data up in the mountain
peaks, and satellite data only serves so far…
new opportunities for Big Data

source: J. Dozier, et al., UCSB

45


46


We can resolve these kinds of
problems; however, solutions
must leverage huge amounts
of data

47


AHEAD

48

Everything’s Bigger in Texas

Agriculture is just one sector, one set of
problems to tackle
We have much, much more here in Texas
For example, Houston is a major center
for Maritime work…

check out:
marinexplore.org

49


There’s also the not so small matter of the
Energy and Transportation sectors

GE is putting sensors in each and every
wind generator, each and every jet engine –
again, the Internet of Things.
I’ve heard rumors there are a few of those
wind turbines out in West Texas?

50


Another of the fastest growing use cases we
see for large-scale predictive modeling is in
Telecom

Think about the stream of CDRs, billions of us
bipeds wandering about the planet with our
phones…
Firehose for that makes Twitter look like MySpace!
The value of location services as data products
for local businesses, communities is astounding

51


AHEAD

52

What is needed?

Approximately 80% of the costs for data-related projects
get spent on data preparation – mostly on cleaning up
data quality issues: ETL, log ﬁle analysis, etc.

Unfortunately, data-related budgets for many companies tend
to go into frameworks which can only be used after clean up

Most valuable skills:
‣ learn to use programmable tools that prepare data
‣ learn to generate compelling data visualizations
‣ learn to estimate the conﬁdence for reported results
‣ learn to automate work, making analysis repeatable

source: D3
53

What else do we need?

• more emphasis on statistical thinking
• not SQL vs. NoSQL, but instead a focus
on apps as the process of structuring data

• multi-disciplinary teams,
not cubicles and silos

• evolving more feedback loops,
to drive more automation

• oddly enough, we need automation
to be able to employ more people
in intelligent, productive ways

• otherwise, we’re left with…

source: Schwa Corporation

54

source: Twentieth Century Fox

55

Thank you very much!

source: Twentieth Century Fox

56

Big Data in Texas: Then, Now, and Ahead

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

More from Paco Nathan

More from Paco Nathan (20)

Recently uploaded

Recently uploaded (20)

Big Data in Texas: Then, Now, and Ahead