1. Big Data and Open Source
▸
‐ Swapnil (Neil) Jadhav
2. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Agenda
Introduction
Key strategic challenges for CDOs/CAOs
Key operational challenges for CDOs/CAOs
Top 10 big data tools and technologies
Why open source?
1 page strategy to implement big data
programs (Source: Gartner)
Next steps
3. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Introduction
Current : Head of Business Intelligence & Analytics for the City of
Carlsbad
Previously : Neil has provided technical and organizational leadership in
the areas of big data and statistical analysis, database management, data
mining, data architecture, and data warehouse design. He has experience
in various industries.
Organizations Industries
Large consulting firms
Dynamic startup organizations
Fortune 500 companies
Government organizations
Oil & Gas – BP (formerly British
Petroleum)
Hi-Tech – Adobe, Fujitsu
Health & Fitness – Beachbody LLC
FMCG – Cadbury, Australia
State & local government – City of
Carlsbad
4. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Key strategic
challenges for
a CDO/CAO
Identify and communicate the business
context for data within big data analytic
projects
Move from “cool experiments” to driving
business value
Use analytics and information governance
to develop a culture of evidence-based
decision making
Information risk management
5. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Key
operational
challenges
New technologies require an experimental
approach - it's a learning exercise
Repeatability is the new demand in big data
Getting the right tools and skills in place
Implement self-service data preparation tools that
can accelerate the shift towards business-user-
generated data discovery and advanced analytics
Reduce the time and complexity of preparing data
for analysis
6. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Big Data tools
&
technologies
(non open
source)
7. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
‘Open Source’
is the new
normal
8. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Top 10 Big Data
tools/technologies
#1
1. Apache Spark™- Runs programs up to
100x faster than Hadoop MapReduce in
memory or 10x faster on disk
Developed at UC Berkeley’s Algorithms, Machines and
People Lab (AMPLab) in 2009, later donated to Apache
in 2010
In-memory vs. Hadoop’s two stage disk based map
reduce
IBM will invest $300 Million, 3500 developers, and over
a dozen of its labs worldwide to spark-related projects
over the next few years
Stable & latest release 1.6, January 4th 2016
9. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Top 10 BigData
tools/technologies
#2
2. R
Needs no explanation on why this made it to this list
One of the highest paid skill
Most-used data science language after SQL
Used by 70% of data miners
Growing faster than any other data science
language
#1 Google Search for Advanced Analytics software
More than 2 million users worldwide
7,829 packages available for use
#1 choice for new graduates
10. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Top 10 Big Data
tools/technologies
#3
3. Talend Open Studio
#1 integration solution to offer GUI support for YARN
2.0
Big data integration without writing code
Real-time statistics for developers to test data jobs
and get immediate statistics
Connect anything, with over 900 connectors with native
support for Hadoop HDFS, HBase, Hive, Pig, Sqoop,
Google BigQuery and NoSQL databases.
Massive scalability that offers MapReduce, Pig and
Hive code
11. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Top 10 Big Data
tools/technologies
#4
4. Apache Storm – it’s all about real-time
processing!
Storm, a distributed computation framework for event
stream processing, began life as a project of BackType,
a marketing intelligence company bought by Twitter in
2011
Twitter soon open-sourced the project and put it on
GitHub, but Storm ultimately moved to the Apache
Incubator and became an Apache top-level project in
September 2014
Apache Storm is getting ready to take on IoT
12. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Top 10 Big Data
tools/technologies
#5
5. Lumify
Open Source under the Apache 2.0 license
Map Integration that allows users to integrate their preferred
GIS solution
Graph Visualization to analyze relationships, automatically
discover paths between entities, and establish new links in 2D or
3D
Live, Shared Workspaces to organize work into separate
workspaces that users can share with colleagues; updates are
pushed to all users viewing the workspace in real-time
Fine Grained Security to protect data with separate access
controls on entire entities, individual properties, and each
relationship
13. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Top 10 Big Data
tools/technologies
#6
6. Apache HIVE
Apache Hive is a data warehouse infrastructure
built on top of Hadoop for providing data
summarization, query, and analysis
Initially developed by Facebook
HiveQL
Execution Environment : Mapreduce, Tex, Spark
Data in HDFS or Hbase
Data Mining, analytics, machine learning, Ad hoc
Analysis
15. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Top 10 Big Data
tools/technologies
#7
7. Mongodb
First developed by MongoDB Inc. in 2007, the company
shifted to open source in 2009, with MongoDB offering
commercial support and other services
First choice of NoSQL developer because it’s easy to learn
Not a one-trick pony, balanced approach to support wide
variety of applications
Suitable for OLTP workloads, not necessarily for reporting
style workloads
Simplicity makes it a great start
The most widely adopted document store DB
16. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Top 10 Big Data
tools/technologies
#8
8. Apache Cassandra
Development simplicity (MongoDB) vs. Operational
simplicity (Cassandra)
MongoDB gets credit for an easy out-of-the-box
experience, Cassandra earns full marks for being easy to
manage at scale
Apple is one of the largest production deployments with
over 75,000 nodes storing over 10 PB of data.
Other large Cassandra installations include Netflix (2,500
nodes, 420 TB, over 1 trillion requests per day), Chinese
search engine Easou (270 nodes, 300 TB, over 800 million
reqests per day), and eBay (over 100 nodes, 250 TB)
17. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Top 10 Big Data
tools/technologies
#9
9. Apache Hbase
A column-oriented key-value store, gets a lot
of use because of its common pedigree with
Hadoop
Highly scalable, modeled after Google’s Big
Table
Facebook messaging platform, Linkedin,
Sophos, Spotify
Data is readily available to users and
applications via SQL queries (using Cloudera
Impala, Apache Phoenix, or Apache Hive)
18. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Top 10 Big Data
tools/technologies
#10
10. Your pick!
19. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Benefits of
Open
Source
Ease of access : Instantly accessible, limited budget option,
enables immediate progress
Low investment entry point : Good support and assistance
available from within the developer community through online
forums, chat rooms and developer networks. Low cost of
support and maintenance e.g. H2O, Dato, Databrix, DataStax
compared to commercial proprietary vendors
Growing base of skills : A lot of training available online,
meetup groups, seminars, and community encourages constant
learning and training
Professional satisfaction : Developers are typically
comfortable with, and enjoy using tools and frameworks to craft
tailor-made analytic solutions. They can participate in and
contribute toward the open source community
20. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Benefits of
Open
Source
Flexibility
Foster analytic agility and avoid vendor lock-in
Innovation : Investment in learning is not wasted,
even if the specific model does not deliver an
immediate outcome
Cutting-edge capabilities : Cutting-edge
approaches, such as new ensemble techniques
and deep learning capabilities, are sometimes
found in open-source solutions years before they
are put into commercial software
21. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Benefits of
Open
Source
Compatibility with open features by commercial
vendors: Many vendors are already incorporating
compatibility with popular open-source languages, interfaces,
analytics libraries and packages, thereby offering more
flexibility to their enterprise analytics platforms. Examples
include Datameer, IBM, Microsoft Azure, Oracle, SAP,
Tableau, Teradata and Tibco Software
Avoiding large IT vendors:
They typically create a large (costly) footprint
Skills training in a special product configuration
becomes increasingly scarce and expensive
If working with a large vendor, you are locked into its
product roadmap
22. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Go
“BiModal”
– says
Gartner!
Combine corporate software with open-source software to be
able to support both bimodal Mode 1 (engineered) and Mode
2 (innovative) approaches
Make investment decisions for advanced analytics capability
based on overall ROI and TCO, not only initial capital
purchase costs
24. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Get
Inspired
Look Outside
Agree Inside
Big data use cases usually center around four types:
1. Operational excellence: Using big data to improve
operations
2. Customer intimacy: Delivering a superior experience,
aka Amazonification
3. Risk management: Mitigating operational, reputational,
financial and strategic risks, including fraud detection
4. New business development: Introducing new products
and services
25. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Get
Going
How do you explain to someone who has never eaten an
orange how it tastes?... It is far easier if you just give them one.
Start with Skills : Build a small team
Try Techniques and Technologies : Be pragmatic about
investments.
Start with the free versions of open-source software (you can
move to a managed version later), and with a straightforward
data lake as the basis for the data.
Use existing hardware or go to the Cloud, and either run the
initiative under the radar, or use a small amount of portfolio
funding to seed a number of experiments.
Anticipate that some of these efforts will lead to no results.
26. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Get
Organized
Create the right architecture - Use the concept
of the "logical data warehouse." In almost all
cases, big data implementations complement the
data warehouse instead of replacing it
Create a governance model
Organizing too early will take forever and
eliminates the experimentation effect. But being
too late with implementing governance, and the
process of taking results into production, leads to
yet another disconnected stovepipe and impacts
user adoption