2. Trigger Warning
This presentation, and materials to
which it links, contains triggers.
These will be triggering reactive,
asynchronous, and message-driven
environments.
A safe room is available in Empire
West, where Alan Viars is
presenting Modernizing National
Health Care.
8. 8
Machine Learning
• What: depends who you ask
– learning that is done by machines [my lab partner]
– algorithms that can learn from and make predictions on data [Wikipedia, just now]
– induction and … other algorithms that can be said to “learn” [Kohavi 1998 goo.gl/WvEmNJ]
– whatever the heck we’re selling [cloud vendors]
– common cognitive framework, ingests content, observe, interpret, evaluate, decide [IBM Watson]
– predictive analytics [Microsoft Azure, AWS]
– algorithmic grab-bag [Mahout, MLlib]
• Why: depends what you want
– Engagement, discovery, decision [Watson]
– Prediction: maintenance, demand, resource allocation [Azure]
– Analytics: fraud, personalization, marketing, churn, support [AWS]
10. 10
Spark MLlib
• Languages: Scala, Java, Python
• Clusters: EC2, YARN, Mesos, standalone
• Linear algebra: Java Breeze / Fortran BLAS
• Data: vector, point, matrix
• Functions
– Basic stats
– Classification and regression
– Collaborative Filtering
– Clustering
– Dimensionality reduction (remove variables)
– Feature extraction & transformation
– Frequent pattern mining
– Optimization (local min/max)
Example: interactive drill-down categories for large result set
11. 11
The Magic of Alternating Least Squares
Latent Factoring
Which is the real me?
Movies recommended for you:
1: The Sound of Music (1965)
2: Snow White and the Seven Dwarfs (1937)
3: Beauty and the Beast (1991)
4: Charlie Brown Christmas, A (1965)
5: Bambi (1942)
6: Seven Brides for Seven Brothers (1954)
7: Mary Poppins (1964)
8: Pinocchio (1940)
9: Gone with the Wind (1939)
10: The Wizard of Oz (1939)
Movies recommended for you:
1: Maradona by Kusturica (2008)
2: Shadows of Forgotten Ancestors (1964)
3: Rosario Tijeras (2005)
4: Constantine's Sword (2007)
5: Titicut Follies (1967)
6: Lady Chatterley (2006)
7: August Evening (2007)
8: Power of Nightmares: The Rise of the
Politics of Fear, The (2004)
9: Sun Alley (Sonnenallee) (1999)
10: Who's Singin' Over There? (a.k.a. Who
Sings Over There) (Ko to tamo peva) (1980)
12. 12
Watson Developer Cloud
• Presented as services for Bluemix
• RESTful calls
• Node.js
• Node-RED
Example: Message resonance for
email solicitation
13. 13
Microsoft Azure
• R and Python
• Flowchart GUI
• Correlation, modeling, trend projection, forecasting
• HDInsight cloud Hadoop
• Publishing for profit via Machine Learning Gallery
– Voice recognition
– Customer churn prediction
– Text extraction: sentiment and key phrase
– Contributor donation propensity
– Frequently bought together
– Classifier
– Clustering
– Linear regression
– … 35 total in market [goo.gl/LhMbUu]
Example: Retail forecasting
15. 15
MongoDB
• Next-gen database
– Document-model
– Scalable
– Highly-available
– Secondary indexes
• Agile with schema and query types
• Subsecond query response over multiple indexes
• Low-second aggregation framework for basic analytics
Example: Number of articles by author
• In-database mapReduce
• Hadoop connector
– Mongo[Input|Output]Format
– mongo.[input|output].uri or BSON
– mongo.input.query
Agility Aggregation Framework
Documents
High Availability Secondary Indexing
Scalability
16. 16
MongoDB Data Operations Spectrum
• Retrieve Nothing – infinitely fast
• Document Retrieval – 1ms if in cache, ~10ms from spinning disk
• .find() – per-document cost similar to single document
– _id range
– any secondary index range, can be composite key
– intersect two indexes
– covered indexes even faster
• .count(), .distinct(), .group() – fast, may be covered
• .aggregate() – retrieval cost like find, plus pipeline operations
– $match
– $group
– $project
– $redact
• .mapReduce() – in-database Javascript
• Hadoop Connector
– mongo.input.query for indexed partial scan
– full scan
Faster…………….....Slower
19. 19
Topic Detection
• Grouping documents according to topics, especially over time
– Google News
• Latent Dirichlet Allocation
– Corpus of M documents, each of N words
Wij at position i in document j
– Documents have (latent) topic distributions α
θi for document i
– Topics have word distributions β, φk for topic k
Zij is topic contributing to word at position j in document i
– Remove stopwords!
• Tweets
– Large, terse corpus
– Highly sensitive to number of iterations
(10 returned little more than word distribution)
– Requires some iterative stopwording
"Smoothed LDA" by Slxu.public - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:Smoothed_LDA.png#/media/File:Smoothed_LDA.png
"Dirichlet distributions" by en:User:ThG - en:Image:Dirichlet_distributions.png. Licensed under Public
Domain via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:Dirichlet_distributions.png#/media/File:Dirichlet_distributions.png
20. *
* Form C := alpha*A**H*B + beta*C.
*
DO 120 J = 1,N
DO 110 I = 1,M
TEMP = ZERO
DO 100 L = 1,K
TEMP = TEMP + CONJG(A(L,I))*B(L,J)
100 CONTINUE
IF (BETA.EQ.ZERO) THEN
C(I,J) = ALPHA*TEMP
ELSE
C(I,J) = ALPHA*TEMP + BETA*C(I,J)
END IF
110 CONTINUE
120 CONTINUE
ELSE
*
* Form C := alpha*A**T*B + beta*C
*
DO 150 J = 1,N
DO 140 I = 1,M
TEMP = ZERO
DO 130 L = 1,K
TEMP = TEMP + A(L,I)*B(L,J)
130 CONTINUE
IF (BETA.EQ.ZERO) THEN
C(I,J) = ALPHA*TEMP
ELSE
C(I,J) = ALPHA*TEMP + BETA*C(I,J)
END IF
140 CONTINUE
150 CONTINUE
END IF
ELSE IF (NOTA) THEN
IF (CONJB) THEN
*
* Form C := alpha*A*B**H + beta*C.
*
DO 200 J = 1,N
IF (BETA.EQ.ZERO) THEN
DO 160 I = 1,M
C(I,J) = ZERO
160 CONTINUE
ELSE IF (BETA.NE.ONE) THEN
DO 170 I = 1,M
C(I,J) = BETA*C(I,J)
170 CONTINUE
END IF
DO 190 L = 1,K
IF (B(J,L).NE.ZERO) THEN
TEMP = ALPHA*CONJG(B(J,L))
DO 180 I = 1,M
C(I,J) = C(I,J) + TEMP*A(I,L)
180 CONTINUE
END IF
190 CONTINUE
200 CONTINUE
ELSE
*
* Form C := alpha*A*B**T + beta*C
*
DO 250 J = 1,N
IF (BETA.EQ.ZERO) THEN
DO 210 I = 1,M
C(I,J) = ZERO
22. val fiveMinBars = groupBars.map(
g => (
g.head.get("_id"),
new BasicBSONObject(g.head.toMap()).
append("Close", g.last.get("Close") ).
append("High", g.map(b => b.get("High").toString.toFloat).reduceLeft(math.max) ).
append("Low", g.map(b => b.get("Low").toString.toFloat).reduceLeft(math.min) ).
append("Volume", g.map(b => b.get("Volume").toString.toInt).foldLeft(0)(_ + _) )
)
)
Operate through Spark on the RDD Object
23. // Create a separate Configuration for saving data back to MongoDB.
val outputConfig = new Configuration()
outputConfig.set("mongo.output.format", "com.mongodb.hadoop.MongoOutputFormat")
outputConfig.set("mongo.output.uri", "mongodb://"
+ mongoPort
+ "/marketdata.fiveminutebars")
fiveMinBars.saveAsNewAPIHadoopFile(
"file:///dummy",
classOf[Any],
classOf[Any],
classOf[MongoOutputFormat[_,_]],
outputConfig)
Put It Back Where You Found It
28. {
_id : ObjectId("4c4ba5e5e8aabf3"),
employee_name: "Dunham, Justin",
department : "Marketing",
title : "Product Manager, Web",
report_up: "Neray, Graham",
pay_band: “C",
benefits : [
{ type : "Health",
plan : "PPO Plus" },
{ type : "Dental",
plan : "Standard" }
]
}
Code/Highlight Example
29. Aggregation Framework Agility Backup Big Data Briefcase
Buildings Business Intelligence Camera Cash Register Catalog
Chat Checkmark Checkmark Cloud Commercial Contract
Computer Content Continuous Development Credit Card Customer Success
30. Data Center Data Variety Data Velocity Data Volume Data Warehouse Database
Dialogue Directory Documents Downloads Drivers Dynamic Schema
EDW Integration Faster Time to Market File Transfer Flexible Gear Hadoop
Health Check High Availability Horizontal Scaling Integrating into Infrastructure Internet of Things Iterative Development
31. Life Preserver Line Graph Lock Log Data Lower Cost Magnifying Glass
Man Mobile Phone Meter Monitoring Music New Apps
New Data Types Online Open Source Parachute Personalization Pin
Platform Certification Product Catalog Puzzle Pieces RDBMS Realtime Analytics Rich Querying
32. Life Preserver RSS Scalability Scale Secondary Indexing Steering Wheel
Stopwatch Text Search Tick Data Training Transmission Tower Trophy
Woman World