Social is generating large volumes of data about the business (who interacts with whom, when, and in what context). However, little of this data is being actively leveraged in order to generate insights that allow the business to work smarter and faster. This technical session describes how to capture and collect interactions within IBM Connections through its public APIs and apply a variety of analytics, including map/reduce and graph analytics, on a scalable Hadoop platform. This allows us to uncover insights into what the corporate network structure looks like, how information propagates across the organization, how are opinions formed, and how resilient is the organization to attrition.
2. Please Note
IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole
discretion.
Information regarding potential future products is intended to outline our general product direction and it should not be
relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver
any material, code or functionality. Information about potential future products may not be incorporated into any contract.
The development, release, and timing of any future features or functionality described for our products remains at our sole
discretion
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment.
The actual throughput or performance that any user will experience will vary depending upon many factors, including
considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage
configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve
results similar to those stated here.
2
3. Agenda
A Peek into Data Science
Extracting IBM Connections data for analytical purposes
Analytics And Connections Data
3
5. What Is This Thing Called Data Science ?
Credit: Rachel Schutt/Cathy O’Neil
5
6. A Single Coffee Receipt
date
cashier
location
12/10/2013
6
time
13:09
Chris
Raleigh500
size qty
reg
1
item
spent
mocha
.80
7. A Year’s Worth Of Coffee Receipts For One Person
01/10/2013 13:53 Chris
size qty item
location
Raleigh500 reg 1 mocha
01/12/2013 14:02 Doug
Carrabou
date
time cashier
spent
.80
reg 1
mocha
.80
01/14/2013 13:09 Nadia Raleigh500 reg 1
vanilla
.75
02/01/2013 14:02 Nadia Raleigh500 lg
mocha
1.10
blend
.60
mocha
1.10
blend
.60
mocha
1.10
03/14/2013 13:14 Chris
1
Raleigh500 reg 1
04/20/2013 13:32 Nadia Stardoe
lg
1
…
12/14/2013 13:14 Bev
Raleigh500 reg 1
12/20/2013 13:32 Nadia Winston’s
7
reg 1
Insights
M-F, 1-2 pm
72% Raleigh500
75% regular
63% mocha
$.87 avg spending
8. A Year’s Worth Of Coffee Receipts For Many People
01/10/2013 13:53 Chris
size qty item
location
Raleigh500 reg 1 mocha
01/12/2013 14:02 Doug
Carrabou
date
time cashier
spent
person
.80
Joel
Toni
reg 1
mocha
.80
01/14/2013 13:09 Nadia Raleigh500 reg 1
vanilla
.75
02/01/2013 14:02 Nadia Raleigh500 lg
mocha
1.10
Joe
blend
.60
Dan
mocha
1.10
Dave
blend
.60
Ken
mocha
1.10
Sally
03/14/2013 13:14 Chris
1
Raleigh500 reg 1
04/20/2013 13:32 Nadia Stardoe
lg
1
…
12/14/2013 13:14 Bev
Raleigh500 reg 1
12/20/2013 13:32 Nadia Winston’s
8
reg 1
Joni
You get the idea…
9. Business Actions From Insights
From a single transaction (one receipt)
To engaging the customer with relevant actions (many receipts)
-
9
Coupons for food
Weekend offers ?
Loyalty card ?
Employee rewards ?
10. Datafication
“The process of taking all aspects of life and turning them into data”
– Google’s augmented-reality glasses
– Twitter for thoughts
– LinkedIn for professional networks
Credit: Kenneth Cukier/Victor Mayer-Schoenberger
May/June 2013 Foreign Affairs
http://tinyurl.com/ke6cqku
Today we’ll show you how to add Lotus Connections to the list
Creating new products with data, improving existing products with data
10
11. The Value of Connections ?
Obvious value:
– Collaboration tool
Perhaps “not so obvious” value:
–“Social Receipts” …Datafication of Interaction Patterns…Business Insights !
Business Insights
Connections
11
Analytics
12. Possible Questions Connections Data Can Help Answer
Are you effectively communicating your message ?
Are other’s responding to your message ?
Are customers, business partners, contractors, employees responding to your message?
Who are brokers of information in the organization ?
What Lotus communities are the most effective ?
What are the communication patterns like between divisions ?
What are the communication characteristics of high performing organizations ?
Ask Your Question… Find Your Business Value
12
14. IBM Connections
Profiles
Forums
Find the people you need
Exchange ideas with, and benefit from the
expertise of others
Communities
Work with people who share common roles
and expertise
Blogs
Present your own ideas, and learn from others
Files
Micro-blogging
Post, share, and discover documents,
presentations, images, and more
Reach out for help your social network
Wikis
Bookmarks
Create web content together
Save, share, and discover bookmarks
Activities
Home page
Organize your work and tap your
professional network
See what's happening across your
social network
15. Connections Maximizes The Value of Social Data
IBM Connections provides APIs and SPIs that allow the
value of the social data to be maximized by external
systems:
– ALL Connections data can be accessed by external
systems
– Open, transparent, breaking down silos
Pull data from IBM Connections
– Programmatically access much of the same
information that you can through the IBM Connections
user interface
Have Connections push data to you
– All data changes (CUD) event in all IBM Connections
components can be supplied to external consumers
17. Connections Architecture
Connections
Atom API
Browser
Mashups
Feed
Reader
Sametime Lotus Notes
Portlets
Microsoft
Office
Your App
HTTP Server & Proxy Cache
REST API
PUT
Common Services
JMX / WSAdmin
Administration
Search
Navigational
Header
Directory
POST
HTML Form
IBM Connections Apps
Person Card
User Directory
DELETE
Atom Entry
RDB
File
System
GET
JavaScript
HTML
Atom Feed
JSON
18. Connections Architecture
Connections
Atom API
Browser
Mashups
Feed
Reader
Sametime Lotus Notes
Portlets
Microsoft
Office
Your App
HTTP Server & Proxy Cache
REST API
PUT
Common Services
JMX / WSAdmin
Administration
Search
DELETE
Atom Entry
POST
GET
HTML Form
JavaScript
HTML
Atom Feed
JSON
IBM Connections Apps
Person Card
User Directory
Navigational
Header
Directory
RDB
File
System
Integration bus
Other Enterprise Services
Event
SPI
Your App
19. The Event SPI is the social data fire-hose
Designed to allow 3rd party to get notified whenever a data
change happens in any of the IBM Connections service
– Real-time events generated by IBM Connections include all
create, update, and delete (CUD) operations.
– Potential to represent the complete interaction footprint of the
enterprise
– Allowing to capture, persist, model, analyze, visualize and
monetize your enterprise network
SPI (System Programming Interface) vs API (Application
Programming Interface)
– SPI at lower level than APIs ... contribute Java code at
system level
– By contributing Java code written to this SPI, 3rd parties can
listen to creation, deletion and update (and more!) events of
content within IBM Connections
20. Event SPI – Programming aspects
Events: collections of data generated when activities (datamodifying, notifications) occur in IBM Connections
– In the SPI, an event is represented by a Java bean /
object
– A Event encapsulate data such as the type of action and
the object (and container) involved in the action
Events are delivered to Event Handlers:
– An event handler is a Java class implemented by a 3rd
party (you!)
– Event handlers are registered in an XML file (eventconfig.xml)
• Instructing what type of event to send to a given
handler
– Connections delivers Java bean representing the event
to registered event handler(s)
Event SPI
Handler 1
Eventconfig.xml
Handler N
Handler 2
21. Event SPI – available data in each event
blog.entry.created:
“Amy Jones posted a blog entry in the blog named XYZ”
Actor
The person who
initiated this action.
Details: External id, name
and, if not disabled, email
address
Type
Type of action
Example:
CREATE,
UPDATE,
DELETE,
NOTIFY,
MEMBERSHIP, ..
Item
Container
General concept for
representing an
individual entity within
a container
General concept for
representing a "bucket"
or "container" that
contains other items
Details: id, name, textual
content, HTML and
ATOM paths
Details: id, name
22. Event SPI – available data in each event
Many more data fields encapsulated in events:
–
Correlation item set to represent parent-child relationship (events about commenting action)
–
Target set, allowing to deduce interaction between content and people
–
Membership delta field, indicating who has been added/removed from a community, activity, ...
–
... see Event SPI documentation for full list (JavaDoc)
Key point: the event model encapsulates
all of data needed to understand the interaction between people, content and
containers in the platform
23. Event SPI in the context of an analytic solution
Challenges of analytics:
Large amount of incoming event stream
– Over 100+ events per second CUD
– Growing on longer term
– Scalable framework for analysis
• Horizontal scale to address
growth
(Near) real-time indexing
No data loss
24. Taming the fire-hose... (1/2)
Analysis, even basic, is time consuming, thus:
Event SPI
Event Handler
“Data backbone”
Storage for asynchronous processing
Goal: retaining as many
events as possible for
further analysis
Analytics Service
Analysis should not occur in the event
handler, but in an external system
(“Analytics Service”)
The event handler should not wait until the
analytic service processes the event
–
It would result in an accumulation of
events at Connections level
–
Problematic as Connections queue
retaining events to be delivered to event
handler has a limited depth
=> Design event handler to consume and
process events as fast as possible, ie: as
the interface between IBM Connections
and an external system
25. Taming the fire-hose... (2/2)
Characteristics of the data backbone
– Distributed and highly available
– Horizontal scale
– High throughput
– Agnostic to consumers' state
Multiple options
– Message broker
MQ / MQTT / ActiveMQ /
Apache Kafka
– Database
– ...
26. Integration with a message broker – Apache Kafka
Java class implementing
the EventHandler
interface
Send JSON
representation of the
event. Serialization to
JSON through Open
Source GSON library
27. Integration with a message broker – Apache Kafka
Registration – through events-config.xml
Java class implementing
EventHandler interface
Subscriptions define the
events delivered by the
SPI to the event handler.
Properties: name/value
pair injected in the event
handler java class.
Typically used to pass
config. settings
Filtered by event name,
source (IBM service),
or/and type (CREATE,
UPDATE, DELETE, ...)
28. Integration with a message broker – Apache Kafka
Deployment – jar and dependencies made available to the SPI (running in the IBM Connections
News application) through a Shared Library in WebSphere Application Server
29. 3rd party events can also participate in the social analytics
solution
IBM Connections provides OpenSocial
Activity Streams APIs allowing 3rd party
to push their own events to the Activity
Stream
From Connections 4.5:
– Events pushed through the Activity Stream
APIs are also surfaced in the Event SPI
– An option allows to NOT surface an event
in the Activity Stream APIs, ie: only surface
through the Event SPIs
=> 3rd party application can also participate in the social analytics graph simply by publishing to
the Connections Activity Stream APIs
30. Pulling data – when is it needed ?
You can “pull” all data from Connections...
but is it really needed?
Good news:
Events surface in most case all data needed for analytics purposes (including the content the event is
about)
Events about the same object repeat data
– If there are X events about the same object, the item/correlation data set will always contain the most
up-to-date information about the referenced object
For an analytic solution – in a nutshell, this means that the Event SPI should be sufficient in most cases
30
31. Pulling data – when is it needed ?
“Push” approach (Event SPI) is sufficient to build most analytic solution
– All necessary content (textual content, tags, …) is surfaced in every single event
– All operation changing relationships (ie: adding/removing member, colleague, follower) are surfaced
as events
“Pull” (REST APIs) approaches should stay limited to:
1. “Bootstrap” the Analytics Service based on a Connections system with data existing prior to the
introduction of the event handler used in your analytic solution
• Essentially building membership/network data (as needed)
• Seeding the content should not be needed, as it is repeated whenever an event about the content
is generated
2. Fetching data not available through the Event SPI
• Relatively “rare” for events generated from Connections
32. Pulling data from Connections
2 main approaches for pulling data from Connections
1. REST APIs (Atom / OpenSocial format)
– REST-style HTTP based APIs (XML, Json format)
– Transparency: programmatically access much of the same information that can be accessed through
the IBM Connections UI
– “Drink your own champagne” - public APIs used internally by plug-ins, mobile … and even some
components Web UI (Activity Stream, Activities, …)
2. Seedlist
– Designed to allow crawling of Connections data for indexing purpose by a search engine
– Surfacing all content in the system – therefore it can be of some value for an analytic solution
– HTTP based APIs (Atom XML format)
32
34. Authentication aspects for the REST APIs
REST APIs support basic authentication, form-based
authentication and (for most APIs) Oauth
Private data: strict enforcement of access on API calls
– Not very convenient for access by an analytic
system...
“Super user”
– Concept of “super user” - access control checks on
private data are by-passed
– The “super user” is a user mapped in the JEE
“admin” role across all Connections services
Public data: APIs that access public data don't require
authentication
– Provided that the environment is not configured to
prevent anonymous access
35. Pulling data from Connections – What to use, when?
REST APIs (Atom / OS APIs)
Pro
Seedlist
•
•
•
Batch retrieval of textual content
Incremental updates (but the Event
SPI is much more suitable for this
purpose)
•
Focused around content - does not
expose all the data (missing tags
membership information, ...)
•
Fine granularity: access content / metadata for a specific object / container
Access relationship information
APIs are available for fetching
membership lists, network information,
who liked a given object, ...
Cons
•
Lack of batch retrieval capabilities
In some very specific cases, data not available in a form easily consumable to build an analytic solution
– Example: getting the list of followers for a given object in the system
– Query directly the Connections databases (in these specific cases only)
– Database schema can change overtime and is private
36. Key points
Leverage the Event SPI as much as possible
– Provides (most of) the data needed for any elaborated
analytics solution
– Just let Connections push data to you! Easier, perform
well
“Fill the gaps” by pulling data from the Atom/Seedlist
APIs
– Initial loading of relationship / content data
– Data not available through the Event SPI
One final warning:
– Analytic solution access to private data through the Event SPI, and Atom/Seedlist APIs (with admin
role)
=> Ensure your solution is not leaking private data to unauthorized users
39. The Analytics Data Service
UI service
node.js
identity
service
data analytics service
Stream
Workflow
Web
Processing coordinator Server
Graph Database
Graph
Analytics
pub/sub Map/Reduce
Tools
Big Table DB
Hadoop/Zookeeper
40. Frequently Heard Big Data Dimensions
A Fuzzy definition:
– 4Vs: volume, velocity, variety, value
– Can’t fit or be processed on a single machine
– data intensive vs. compute intensive
– Analytics focused
40
41. Big Data Aspects For Us To Consider
Connections data:
semi-structured, line formatted output, that works well with “a hadoop cluster” and graph
time and spacial aspects
de-normalized
combined with multiple data sources
calculations = data too
explored for insights, innovate with data
doesn’t ‘expire’, sticky
The difference between “BI” and “Analytics”
– Hadoop environments are designed to interpret the data at processing time
– Processing attributes chosen by the person processing the data
41
42. ‘Simple’ Analytics Are Often Best
More data usually beats better algorithms
– LOTs of data. Simple algorithms is not a bad plan.
But you will probably always want to ‘sample’ for efficiency
Credit: Anand Rajaraman, Netflix
http://anand.typepad.com/datawocky/2008/03/more-data-usual.html
42
43. Handling The Data From Connections
Full Refresh
– Often called “bulk load”
Delta Updates
– Streaming via the SPI
What do you do with the data as it comes in ?
– Files ?
– Directly into stores ?
– Directly into analytics ?
A need for real time analytics ?
43
44. Why A Property Graph In Analytics ?
A property graph has:
– key/value properties
– both vertices and edges can have any number of properties
– directed relationships
– (hint: this is not rdf)
Reference: https://github.com/tinkerpop/gremlin/wiki/Defining-a-Property-Graph
We want to answer questions like:
– Context around the event
– Cause and effect of an event
– Things related to an event
Property graphs are a very useful tool
– Data science part
– Production part
44
Name: bob
calls
Name:roger
45. Graph Analytics: A Specific Example For Connections Data
em·i·nence
noun ˈe-mə-nən(t)s
: a condition of being well-known and successful
Source: Merriam-Webstertechnology in our analytics service to calculate a person’s eminence ?
How might we use graph Online
45
46. Graph Analytics – A Glimpse At Eminence Calculations
A real eminence score can
have 13 or more measures
just from Connections meta
data alone.
creates
Person A
Status Update
comments on
creates
Status Update
Comment
Person B
Look for this graph pattern, then
count comments and weight
by who commented, normalize… = an eminence score
element
46
48. Gradually Add More Data and Analytics For Deeper Insights
Connections
CRM
Finding potentially obese people…
Source: The Wall Street Journal
Articles
Other…
48
E-mail
For us:
What other data is coming in the
Connections Event SPI ?
(hint: it can be more than just
connections data)
Twitter
What other sources of data are
there outside of Connections ?
49. Summary: Find Business Value In Your Connections Data
From “transactions”/“social receipts”
To insights
Effective use of Connections APIs
Key insights using Big Data Analytics on Connections Data
Engagement for better productivity and faster execution –
– at the personal, organizational and company wide levels
Your insights are limited only by the data and your ability to process it for insights
49
50. For More Information
Visit IBM’s Emerging Technology Page !
http://www.ibm.com/sna
http://www.ibm.com/engage
Stop by the Innovation center to see more
I’ll be there to answer your specific questions !
More information about the Connections APIs and SPIs in the IBM
Connections product wiki under “Developing”
50
51. Access Connect Online to complete your session surveys using any:
– Web or mobile browser
– Connect Online kiosk onsite
51
52. Engage Online
SocialBiz User Group socialbizug.org
– Join the epicenter of Notes and Collaboration user groups
Follow us on Twitter
– @IBMConnect and @IBMSocialBiz
LinkedIn http://bit.ly/SBComm
– Participate in the IBM Social Business group on LinkedIn:
Facebook https://www.facebook.com/IBMSocialBiz
– Like IBM Social Business on Facebook
Social Business Insights blog ibm.com/blogs/socialbusiness
– Read and engage with our bloggers
52