SlideShare a Scribd company logo
1 of 28
Download to read offline
Big Data and Open Source
▸
‐ Swapnil	(Neil)	Jadhav
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Agenda
 Introduction
 Key strategic challenges for CDOs/CAOs
 Key operational challenges for CDOs/CAOs
 Top 10 big data tools and technologies
 Why open source?
 1 page strategy to implement big data
programs (Source: Gartner)
 Next steps
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Introduction
 Current : Head of Business Intelligence & Analytics for the City of
Carlsbad
 Previously : Neil has provided technical and organizational leadership in
the areas of big data and statistical analysis, database management, data
mining, data architecture, and data warehouse design. He has experience
in various industries.
Organizations Industries
 Large consulting firms
 Dynamic startup organizations
 Fortune 500 companies
 Government organizations
 Oil & Gas – BP (formerly British
Petroleum)
 Hi-Tech – Adobe, Fujitsu
 Health & Fitness – Beachbody LLC
 FMCG – Cadbury, Australia
 State & local government – City of
Carlsbad
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Key strategic
challenges for
a CDO/CAO
 Identify and communicate the business
context for data within big data analytic
projects
 Move from “cool experiments” to driving
business value
 Use analytics and information governance
to develop a culture of evidence-based
decision making
 Information risk management
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Key
operational
challenges
 New technologies require an experimental
approach - it's a learning exercise
 Repeatability is the new demand in big data
 Getting the right tools and skills in place
 Implement self-service data preparation tools that
can accelerate the shift towards business-user-
generated data discovery and advanced analytics
 Reduce the time and complexity of preparing data
for analysis
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Big Data tools
&
technologies
(non open
source)
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
‘Open Source’
is the new
normal
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Top 10 Big Data
tools/technologies
#1
1. Apache Spark™- Runs programs up to
100x faster than Hadoop MapReduce in
memory or 10x faster on disk
 Developed at UC Berkeley’s Algorithms, Machines and
People Lab (AMPLab) in 2009, later donated to Apache
in 2010
 In-memory vs. Hadoop’s two stage disk based map
reduce
 IBM will invest $300 Million, 3500 developers, and over
a dozen of its labs worldwide to spark-related projects
over the next few years
 Stable & latest release 1.6, January 4th 2016
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Top 10 BigData
tools/technologies
#2
2. R
 Needs no explanation on why this made it to this list
 One of the highest paid skill
 Most-used data science language after SQL
 Used by 70% of data miners
 Growing faster than any other data science
language
 #1 Google Search for Advanced Analytics software
 More than 2 million users worldwide
 7,829 packages available for use
 #1 choice for new graduates
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Top 10 Big Data
tools/technologies
#3
3. Talend Open Studio
 #1 integration solution to offer GUI support for YARN
2.0
 Big data integration without writing code
 Real-time statistics for developers to test data jobs
and get immediate statistics
 Connect anything, with over 900 connectors with native
support for Hadoop HDFS, HBase, Hive, Pig, Sqoop,
Google BigQuery and NoSQL databases.
 Massive scalability that offers MapReduce, Pig and
Hive code
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Top 10 Big Data
tools/technologies
#4
4. Apache Storm – it’s all about real-time
processing!
 Storm, a distributed computation framework for event
stream processing, began life as a project of BackType,
a marketing intelligence company bought by Twitter in
2011
 Twitter soon open-sourced the project and put it on
GitHub, but Storm ultimately moved to the Apache
Incubator and became an Apache top-level project in
September 2014
 Apache Storm is getting ready to take on IoT
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Top 10 Big Data
tools/technologies
#5
5. Lumify
 Open Source under the Apache 2.0 license
 Map Integration that allows users to integrate their preferred
GIS solution
 Graph Visualization to analyze relationships, automatically
discover paths between entities, and establish new links in 2D or
3D
 Live, Shared Workspaces to organize work into separate
workspaces that users can share with colleagues; updates are
pushed to all users viewing the workspace in real-time
 Fine Grained Security to protect data with separate access
controls on entire entities, individual properties, and each
relationship
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Top 10 Big Data
tools/technologies
#6
6. Apache HIVE
 Apache Hive is a data warehouse infrastructure
built on top of Hadoop for providing data
summarization, query, and analysis
 Initially developed by Facebook
 HiveQL
 Execution Environment : Mapreduce, Tex, Spark
 Data in HDFS or Hbase
 Data Mining, analytics, machine learning, Ad hoc
Analysis
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
NoSQL
Databases
skills index
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Top 10 Big Data
tools/technologies
#7
7. Mongodb
 First developed by MongoDB Inc. in 2007, the company
shifted to open source in 2009, with MongoDB offering
commercial support and other services
 First choice of NoSQL developer because it’s easy to learn
 Not a one-trick pony, balanced approach to support wide
variety of applications
 Suitable for OLTP workloads, not necessarily for reporting
style workloads
 Simplicity makes it a great start
 The most widely adopted document store DB
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Top 10 Big Data
tools/technologies
#8
8. Apache Cassandra
 Development simplicity (MongoDB) vs. Operational
simplicity (Cassandra)
 MongoDB gets credit for an easy out-of-the-box
experience, Cassandra earns full marks for being easy to
manage at scale
 Apple is one of the largest production deployments with
over 75,000 nodes storing over 10 PB of data.
 Other large Cassandra installations include Netflix (2,500
nodes, 420 TB, over 1 trillion requests per day), Chinese
search engine Easou (270 nodes, 300 TB, over 800 million
reqests per day), and eBay (over 100 nodes, 250 TB)
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Top 10 Big Data
tools/technologies
#9
9. Apache Hbase
 A column-oriented key-value store, gets a lot
of use because of its common pedigree with
Hadoop
 Highly scalable, modeled after Google’s Big
Table
 Facebook messaging platform, Linkedin,
Sophos, Spotify
 Data is readily available to users and
applications via SQL queries (using Cloudera
Impala, Apache Phoenix, or Apache Hive)
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Top 10 Big Data
tools/technologies
#10
10. Your pick!
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Benefits of
Open
Source
 Ease of access : Instantly accessible, limited budget option,
enables immediate progress
 Low investment entry point : Good support and assistance
available from within the developer community through online
forums, chat rooms and developer networks. Low cost of
support and maintenance e.g. H2O, Dato, Databrix, DataStax
compared to commercial proprietary vendors
 Growing base of skills : A lot of training available online,
meetup groups, seminars, and community encourages constant
learning and training
 Professional satisfaction : Developers are typically
comfortable with, and enjoy using tools and frameworks to craft
tailor-made analytic solutions. They can participate in and
contribute toward the open source community
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Benefits of
Open
Source
 Flexibility
 Foster analytic agility and avoid vendor lock-in
 Innovation : Investment in learning is not wasted,
even if the specific model does not deliver an
immediate outcome
 Cutting-edge capabilities : Cutting-edge
approaches, such as new ensemble techniques
and deep learning capabilities, are sometimes
found in open-source solutions years before they
are put into commercial software
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Benefits of
Open
Source
 Compatibility with open features by commercial
vendors: Many vendors are already incorporating
compatibility with popular open-source languages, interfaces,
analytics libraries and packages, thereby offering more
flexibility to their enterprise analytics platforms. Examples
include Datameer, IBM, Microsoft Azure, Oracle, SAP,
Tableau, Teradata and Tibco Software
 Avoiding large IT vendors:
 They typically create a large (costly) footprint
Skills training in a special product configuration
becomes increasingly scarce and expensive
 If working with a large vendor, you are locked into its
product roadmap
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Go
“BiModal”
– says
Gartner!
 Combine corporate software with open-source software to be
able to support both bimodal Mode 1 (engineered) and Mode
2 (innovative) approaches
 Make investment decisions for advanced analytics capability
based on overall ROI and TCO, not only initial capital
purchase costs
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Quick
1 page
strategy
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Get
Inspired
 Look Outside
 Agree Inside
Big data use cases usually center around four types:
1. Operational excellence: Using big data to improve
operations
2. Customer intimacy: Delivering a superior experience,
aka Amazonification
3. Risk management: Mitigating operational, reputational,
financial and strategic risks, including fraud detection
4. New business development: Introducing new products
and services
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Get
Going
 How do you explain to someone who has never eaten an
orange how it tastes?... It is far easier if you just give them one.
 Start with Skills : Build a small team
 Try Techniques and Technologies : Be pragmatic about
investments.
 Start with the free versions of open-source software (you can
move to a managed version later), and with a straightforward
data lake as the basis for the data.
 Use existing hardware or go to the Cloud, and either run the
initiative under the radar, or use a small amount of portfolio
funding to seed a number of experiments.
 Anticipate that some of these efforts will lead to no results.
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Get
Organized
 Create the right architecture - Use the concept
of the "logical data warehouse." In almost all
cases, big data implementations complement the
data warehouse instead of replacing it
 Create a governance model
 Organizing too early will take forever and
eliminates the experimentation effect. But being
too late with implementing governance, and the
process of taking results into production, leads to
yet another disconnected stovepipe and impacts
user adoption
Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772
Next
Steps
Questions?
Connect with me!
Email : jadhavswapnil@gmail.com
Cell : 408.636.3772
Linkedin : https://www.linkedin.com/in/jadhavswapnil

More Related Content

What's hot

Hortonworks & IBM solutions
Hortonworks & IBM solutionsHortonworks & IBM solutions
Hortonworks & IBM solutionsThiago Santiago
 
Building intelligent applications, experimental ML with Uber’s Data Science W...
Building intelligent applications, experimental ML with Uber’s Data Science W...Building intelligent applications, experimental ML with Uber’s Data Science W...
Building intelligent applications, experimental ML with Uber’s Data Science W...DataWorks Summit
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data FrameworkseXascale Infolab
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...DataWorks Summit
 
Yahoo Microstrategy 2008
Yahoo Microstrategy 2008Yahoo Microstrategy 2008
Yahoo Microstrategy 2008Amr Awadallah
 
IoT: How Data Science Driven Software is Eating the Connected World
IoT: How Data Science Driven Software is Eating the Connected WorldIoT: How Data Science Driven Software is Eating the Connected World
IoT: How Data Science Driven Software is Eating the Connected WorldDataWorks Summit
 
Rob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopRob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopGhassan Al-Yafie
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 
H2O.ai - Road Ahead - keynote presentation by Sri Ambati
H2O.ai - Road Ahead - keynote presentation by Sri AmbatiH2O.ai - Road Ahead - keynote presentation by Sri Ambati
H2O.ai - Road Ahead - keynote presentation by Sri AmbatiSri Ambati
 
A Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to FinanceA Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to FinanceSlim Baltagi
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataHaluan Irsad
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUBAhmed Salman
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyNishant Gandhi
 
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...CloudxLab
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDataWorks Summit
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeDataWorks Summit
 
Hadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreHadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreTrendwise Analytics
 
Data science workshop
Data science workshopData science workshop
Data science workshopHortonworks
 

What's hot (20)

Hortonworks & IBM solutions
Hortonworks & IBM solutionsHortonworks & IBM solutions
Hortonworks & IBM solutions
 
Building intelligent applications, experimental ML with Uber’s Data Science W...
Building intelligent applications, experimental ML with Uber’s Data Science W...Building intelligent applications, experimental ML with Uber’s Data Science W...
Building intelligent applications, experimental ML with Uber’s Data Science W...
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
 
Yahoo Microstrategy 2008
Yahoo Microstrategy 2008Yahoo Microstrategy 2008
Yahoo Microstrategy 2008
 
IoT: How Data Science Driven Software is Eating the Connected World
IoT: How Data Science Driven Software is Eating the Connected WorldIoT: How Data Science Driven Software is Eating the Connected World
IoT: How Data Science Driven Software is Eating the Connected World
 
Rob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopRob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoop
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
H2O.ai - Road Ahead - keynote presentation by Sri Ambati
H2O.ai - Road Ahead - keynote presentation by Sri AmbatiH2O.ai - Road Ahead - keynote presentation by Sri Ambati
H2O.ai - Road Ahead - keynote presentation by Sri Ambati
 
A Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to FinanceA Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to Finance
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application code
 
Hadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreHadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and More
 
Data science workshop
Data science workshopData science workshop
Data science workshop
 

Similar to Big Data & Open Source - Neil Jadhav

C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...Hortonworks
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaSkillspeed
 
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsightNaoki (Neo) SATO
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Career opportunities in open source framework
Career opportunities in open source frameworkCareer opportunities in open source framework
Career opportunities in open source frameworkedunextgen
 
Career opportunities in open source framework
Career opportunities in open source framework Career opportunities in open source framework
Career opportunities in open source framework edunextgen
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopHazelcast
 
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot ProgramszData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot ProgramszData Inc.
 
How to Become a Big Data Professional.pdf
How to Become a Big Data Professional.pdfHow to Become a Big Data Professional.pdf
How to Become a Big Data Professional.pdfCareervira
 
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...SoftServe
 
Exploring the Wider World of Big Data
Exploring the Wider World of Big DataExploring the Wider World of Big Data
Exploring the Wider World of Big DataNetApp
 
Coding software and tools used for data science management - Phdassistance
Coding software and tools used for data science management - PhdassistanceCoding software and tools used for data science management - Phdassistance
Coding software and tools used for data science management - PhdassistancephdAssistance1
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Big Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsBig Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsFredReynolds2
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)Sascha Dittmann
 
IBM Smarter Analytics
IBM Smarter AnalyticsIBM Smarter Analytics
IBM Smarter AnalyticsAdrian Turcu
 
Yahoo! Hack Europe
Yahoo! Hack EuropeYahoo! Hack Europe
Yahoo! Hack EuropeHortonworks
 
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...phdAssistance1
 
Big Data on Public Cloud
Big Data on Public CloudBig Data on Public Cloud
Big Data on Public CloudIMC Institute
 

Similar to Big Data & Open Source - Neil Jadhav (20)

C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social Media
 
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Career opportunities in open source framework
Career opportunities in open source frameworkCareer opportunities in open source framework
Career opportunities in open source framework
 
Career opportunities in open source framework
Career opportunities in open source framework Career opportunities in open source framework
Career opportunities in open source framework
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
 
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot ProgramszData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
 
How to Become a Big Data Professional.pdf
How to Become a Big Data Professional.pdfHow to Become a Big Data Professional.pdf
How to Become a Big Data Professional.pdf
 
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
 
Exploring the Wider World of Big Data
Exploring the Wider World of Big DataExploring the Wider World of Big Data
Exploring the Wider World of Big Data
 
Coding software and tools used for data science management - Phdassistance
Coding software and tools used for data science management - PhdassistanceCoding software and tools used for data science management - Phdassistance
Coding software and tools used for data science management - Phdassistance
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Big Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsBig Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential Tools
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
 
IBM Smarter Analytics
IBM Smarter AnalyticsIBM Smarter Analytics
IBM Smarter Analytics
 
Yahoo! Hack Europe
Yahoo! Hack EuropeYahoo! Hack Europe
Yahoo! Hack Europe
 
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
 
Big Data on Public Cloud
Big Data on Public CloudBig Data on Public Cloud
Big Data on Public Cloud
 
RESUME_N
RESUME_NRESUME_N
RESUME_N
 

Big Data & Open Source - Neil Jadhav

  • 1. Big Data and Open Source ▸ ‐ Swapnil (Neil) Jadhav
  • 2. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Agenda  Introduction  Key strategic challenges for CDOs/CAOs  Key operational challenges for CDOs/CAOs  Top 10 big data tools and technologies  Why open source?  1 page strategy to implement big data programs (Source: Gartner)  Next steps
  • 3. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Introduction  Current : Head of Business Intelligence & Analytics for the City of Carlsbad  Previously : Neil has provided technical and organizational leadership in the areas of big data and statistical analysis, database management, data mining, data architecture, and data warehouse design. He has experience in various industries. Organizations Industries  Large consulting firms  Dynamic startup organizations  Fortune 500 companies  Government organizations  Oil & Gas – BP (formerly British Petroleum)  Hi-Tech – Adobe, Fujitsu  Health & Fitness – Beachbody LLC  FMCG – Cadbury, Australia  State & local government – City of Carlsbad
  • 4. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Key strategic challenges for a CDO/CAO  Identify and communicate the business context for data within big data analytic projects  Move from “cool experiments” to driving business value  Use analytics and information governance to develop a culture of evidence-based decision making  Information risk management
  • 5. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Key operational challenges  New technologies require an experimental approach - it's a learning exercise  Repeatability is the new demand in big data  Getting the right tools and skills in place  Implement self-service data preparation tools that can accelerate the shift towards business-user- generated data discovery and advanced analytics  Reduce the time and complexity of preparing data for analysis
  • 6. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Big Data tools & technologies (non open source)
  • 7. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 ‘Open Source’ is the new normal
  • 8. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Top 10 Big Data tools/technologies #1 1. Apache Spark™- Runs programs up to 100x faster than Hadoop MapReduce in memory or 10x faster on disk  Developed at UC Berkeley’s Algorithms, Machines and People Lab (AMPLab) in 2009, later donated to Apache in 2010  In-memory vs. Hadoop’s two stage disk based map reduce  IBM will invest $300 Million, 3500 developers, and over a dozen of its labs worldwide to spark-related projects over the next few years  Stable & latest release 1.6, January 4th 2016
  • 9. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Top 10 BigData tools/technologies #2 2. R  Needs no explanation on why this made it to this list  One of the highest paid skill  Most-used data science language after SQL  Used by 70% of data miners  Growing faster than any other data science language  #1 Google Search for Advanced Analytics software  More than 2 million users worldwide  7,829 packages available for use  #1 choice for new graduates
  • 10. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Top 10 Big Data tools/technologies #3 3. Talend Open Studio  #1 integration solution to offer GUI support for YARN 2.0  Big data integration without writing code  Real-time statistics for developers to test data jobs and get immediate statistics  Connect anything, with over 900 connectors with native support for Hadoop HDFS, HBase, Hive, Pig, Sqoop, Google BigQuery and NoSQL databases.  Massive scalability that offers MapReduce, Pig and Hive code
  • 11. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Top 10 Big Data tools/technologies #4 4. Apache Storm – it’s all about real-time processing!  Storm, a distributed computation framework for event stream processing, began life as a project of BackType, a marketing intelligence company bought by Twitter in 2011  Twitter soon open-sourced the project and put it on GitHub, but Storm ultimately moved to the Apache Incubator and became an Apache top-level project in September 2014  Apache Storm is getting ready to take on IoT
  • 12. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Top 10 Big Data tools/technologies #5 5. Lumify  Open Source under the Apache 2.0 license  Map Integration that allows users to integrate their preferred GIS solution  Graph Visualization to analyze relationships, automatically discover paths between entities, and establish new links in 2D or 3D  Live, Shared Workspaces to organize work into separate workspaces that users can share with colleagues; updates are pushed to all users viewing the workspace in real-time  Fine Grained Security to protect data with separate access controls on entire entities, individual properties, and each relationship
  • 13. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Top 10 Big Data tools/technologies #6 6. Apache HIVE  Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis  Initially developed by Facebook  HiveQL  Execution Environment : Mapreduce, Tex, Spark  Data in HDFS or Hbase  Data Mining, analytics, machine learning, Ad hoc Analysis
  • 14. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 NoSQL Databases skills index
  • 15. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Top 10 Big Data tools/technologies #7 7. Mongodb  First developed by MongoDB Inc. in 2007, the company shifted to open source in 2009, with MongoDB offering commercial support and other services  First choice of NoSQL developer because it’s easy to learn  Not a one-trick pony, balanced approach to support wide variety of applications  Suitable for OLTP workloads, not necessarily for reporting style workloads  Simplicity makes it a great start  The most widely adopted document store DB
  • 16. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Top 10 Big Data tools/technologies #8 8. Apache Cassandra  Development simplicity (MongoDB) vs. Operational simplicity (Cassandra)  MongoDB gets credit for an easy out-of-the-box experience, Cassandra earns full marks for being easy to manage at scale  Apple is one of the largest production deployments with over 75,000 nodes storing over 10 PB of data.  Other large Cassandra installations include Netflix (2,500 nodes, 420 TB, over 1 trillion requests per day), Chinese search engine Easou (270 nodes, 300 TB, over 800 million reqests per day), and eBay (over 100 nodes, 250 TB)
  • 17. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Top 10 Big Data tools/technologies #9 9. Apache Hbase  A column-oriented key-value store, gets a lot of use because of its common pedigree with Hadoop  Highly scalable, modeled after Google’s Big Table  Facebook messaging platform, Linkedin, Sophos, Spotify  Data is readily available to users and applications via SQL queries (using Cloudera Impala, Apache Phoenix, or Apache Hive)
  • 18. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Top 10 Big Data tools/technologies #10 10. Your pick!
  • 19. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Benefits of Open Source  Ease of access : Instantly accessible, limited budget option, enables immediate progress  Low investment entry point : Good support and assistance available from within the developer community through online forums, chat rooms and developer networks. Low cost of support and maintenance e.g. H2O, Dato, Databrix, DataStax compared to commercial proprietary vendors  Growing base of skills : A lot of training available online, meetup groups, seminars, and community encourages constant learning and training  Professional satisfaction : Developers are typically comfortable with, and enjoy using tools and frameworks to craft tailor-made analytic solutions. They can participate in and contribute toward the open source community
  • 20. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Benefits of Open Source  Flexibility  Foster analytic agility and avoid vendor lock-in  Innovation : Investment in learning is not wasted, even if the specific model does not deliver an immediate outcome  Cutting-edge capabilities : Cutting-edge approaches, such as new ensemble techniques and deep learning capabilities, are sometimes found in open-source solutions years before they are put into commercial software
  • 21. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Benefits of Open Source  Compatibility with open features by commercial vendors: Many vendors are already incorporating compatibility with popular open-source languages, interfaces, analytics libraries and packages, thereby offering more flexibility to their enterprise analytics platforms. Examples include Datameer, IBM, Microsoft Azure, Oracle, SAP, Tableau, Teradata and Tibco Software  Avoiding large IT vendors:  They typically create a large (costly) footprint Skills training in a special product configuration becomes increasingly scarce and expensive  If working with a large vendor, you are locked into its product roadmap
  • 22. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Go “BiModal” – says Gartner!  Combine corporate software with open-source software to be able to support both bimodal Mode 1 (engineered) and Mode 2 (innovative) approaches  Make investment decisions for advanced analytics capability based on overall ROI and TCO, not only initial capital purchase costs
  • 23. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Quick 1 page strategy
  • 24. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Get Inspired  Look Outside  Agree Inside Big data use cases usually center around four types: 1. Operational excellence: Using big data to improve operations 2. Customer intimacy: Delivering a superior experience, aka Amazonification 3. Risk management: Mitigating operational, reputational, financial and strategic risks, including fraud detection 4. New business development: Introducing new products and services
  • 25. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Get Going  How do you explain to someone who has never eaten an orange how it tastes?... It is far easier if you just give them one.  Start with Skills : Build a small team  Try Techniques and Technologies : Be pragmatic about investments.  Start with the free versions of open-source software (you can move to a managed version later), and with a straightforward data lake as the basis for the data.  Use existing hardware or go to the Cloud, and either run the initiative under the radar, or use a small amount of portfolio funding to seed a number of experiments.  Anticipate that some of these efforts will lead to no results.
  • 26. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Get Organized  Create the right architecture - Use the concept of the "logical data warehouse." In almost all cases, big data implementations complement the data warehouse instead of replacing it  Create a governance model  Organizing too early will take forever and eliminates the experimentation effect. But being too late with implementing governance, and the process of taking results into production, leads to yet another disconnected stovepipe and impacts user adoption
  • 27. Swapnil (Neil) Jadhav | jadhavswapnil@gmail.com | 408.636.3772 Next Steps
  • 28. Questions? Connect with me! Email : jadhavswapnil@gmail.com Cell : 408.636.3772 Linkedin : https://www.linkedin.com/in/jadhavswapnil