This document summarizes a presentation given by Al Nevarez and Sally Sadosky of LinkedIn on how the company uses market research and big data analytics. It discusses LinkedIn's business goals and vision, how it conducts market research through surveys, and how it analyzes massive amounts of member data using tools like Hadoop and Pig to gain insights at low cost. Integrating survey data with behavioral data through SQL joins allows answering questions about member segments and experiences.
This is a presentation about big data with Java. In those slides, you can find why big data is so important and some of the tools that are used for creating big data applications like Apache Hadoop, Apache Spark, Apache Kafka and etc.
Five Critical Success Factors for Big Data and Traditional BIInside Analysis
The Briefing Room with Dr. Robin Bloor and VelociData
Live Webcast Dec. 10, 2013
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?AT=pb&SP=EC&rID=7909837&rKey=b0bac7d09bf1a638
Most Big Data discussions focus on analytics, but business users need more than that. They need speed, because most opportunities these days are transient and must be acted on quickly. Bottlenecks in the delivery of analytic results often occur on the gathering and transformation side, where massive volumes of data must be validated, converted, masked or otherwise transformed before hitting the analytics engine. Big Data is rapidly overrunning conventional approaches, creating requirements for accelerated, hybrid systems.
Register for this episode of the Briefing Room to hear veteran IT Analyst Dr. Robin Bloor, as he explains how a combination of innovations is dramatically changing how companies can solve serious data transformation challenges. Robin will be briefed by Ron Indeck of VelociData, who will tout their record-breaking data operations appliance. He'll also discuss five critical success factors for achieving optimal performance, including the necessary infrastructure for executing data transformations at wire speed.
Visit InsideAnalysis.com for more information
Lessons from building a stream-first metadata platform | Shirshanka Das, StealthHostedbyConfluent
"For data-driven enterprises, the most important objective is unlocking the value of their data. To enable this, data scientists are increasingly turning towards data discovery tools (also known as data catalogs) that can help them locate the right dataset or insight and use it correctly. But are all data catalogs the same?
In this talk, I describe how a stream-first architecture was a critical design element that benefited the implementation of our data catalog. We follow the evolution of LinkedIn DataHub’s architecture over the past few years from a simple search tool to a streaming metadata platform that drives productivity and governance workflows across the company.
Join this talk to learn:
* How different data discovery / catalog tools are architected and the tradeoffs in each kind of architecture
* How streaming architectures can benefit metadata
* How event-driven metadata architectures can supercharge your data productivity and governance workflows at your company"
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...Markus Harrer
Let’s tackle problems in software development in an automated, data-driven and reproducible way!
As developers, we often feel that there might be something wrong with the way we develop software. Unfortunately, a gut feeling alone isn’t sufficient for the complex, interconnected problems in software systems.
We need solid, understandable arguments to gain budgets for improvement projects or to defend us against political decisions. Though, we can help ourselves: Every step in the development or use of software leaves valuable, digital traces. With clever analysis, these data can show us root causes of problems in our software and deliver new insights – understandable for everybody.
If concrete problems and their impact are known, developers and managers can create solutions and take sustainable actions aligned to existing business goals.
In this meetup, I talk about the analysis of software data by using a digital notebook approach. This allows you to express your gut feelings explicitly with the help of hypotheses, explorations and visualizations step by step.
I show the collaboration of open source analysis tools (Jupyter, Pandas, jQAssistant and, of course, Neo4j) to inspect problems in Java applications and their environment. We have a look at performance hotspots, knowledge loss and worthless code parts – completely automated from raw data up to visualizations for management.
Participants learn how they can translate their unsafe gut feelings into solid evidence for obtaining budgets for dedicated improvement projects with the help of data analysis.
This is a presentation about big data with Java. In those slides, you can find why big data is so important and some of the tools that are used for creating big data applications like Apache Hadoop, Apache Spark, Apache Kafka and etc.
Five Critical Success Factors for Big Data and Traditional BIInside Analysis
The Briefing Room with Dr. Robin Bloor and VelociData
Live Webcast Dec. 10, 2013
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?AT=pb&SP=EC&rID=7909837&rKey=b0bac7d09bf1a638
Most Big Data discussions focus on analytics, but business users need more than that. They need speed, because most opportunities these days are transient and must be acted on quickly. Bottlenecks in the delivery of analytic results often occur on the gathering and transformation side, where massive volumes of data must be validated, converted, masked or otherwise transformed before hitting the analytics engine. Big Data is rapidly overrunning conventional approaches, creating requirements for accelerated, hybrid systems.
Register for this episode of the Briefing Room to hear veteran IT Analyst Dr. Robin Bloor, as he explains how a combination of innovations is dramatically changing how companies can solve serious data transformation challenges. Robin will be briefed by Ron Indeck of VelociData, who will tout their record-breaking data operations appliance. He'll also discuss five critical success factors for achieving optimal performance, including the necessary infrastructure for executing data transformations at wire speed.
Visit InsideAnalysis.com for more information
Lessons from building a stream-first metadata platform | Shirshanka Das, StealthHostedbyConfluent
"For data-driven enterprises, the most important objective is unlocking the value of their data. To enable this, data scientists are increasingly turning towards data discovery tools (also known as data catalogs) that can help them locate the right dataset or insight and use it correctly. But are all data catalogs the same?
In this talk, I describe how a stream-first architecture was a critical design element that benefited the implementation of our data catalog. We follow the evolution of LinkedIn DataHub’s architecture over the past few years from a simple search tool to a streaming metadata platform that drives productivity and governance workflows across the company.
Join this talk to learn:
* How different data discovery / catalog tools are architected and the tradeoffs in each kind of architecture
* How streaming architectures can benefit metadata
* How event-driven metadata architectures can supercharge your data productivity and governance workflows at your company"
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...Markus Harrer
Let’s tackle problems in software development in an automated, data-driven and reproducible way!
As developers, we often feel that there might be something wrong with the way we develop software. Unfortunately, a gut feeling alone isn’t sufficient for the complex, interconnected problems in software systems.
We need solid, understandable arguments to gain budgets for improvement projects or to defend us against political decisions. Though, we can help ourselves: Every step in the development or use of software leaves valuable, digital traces. With clever analysis, these data can show us root causes of problems in our software and deliver new insights – understandable for everybody.
If concrete problems and their impact are known, developers and managers can create solutions and take sustainable actions aligned to existing business goals.
In this meetup, I talk about the analysis of software data by using a digital notebook approach. This allows you to express your gut feelings explicitly with the help of hypotheses, explorations and visualizations step by step.
I show the collaboration of open source analysis tools (Jupyter, Pandas, jQAssistant and, of course, Neo4j) to inspect problems in Java applications and their environment. We have a look at performance hotspots, knowledge loss and worthless code parts – completely automated from raw data up to visualizations for management.
Participants learn how they can translate their unsafe gut feelings into solid evidence for obtaining budgets for dedicated improvement projects with the help of data analysis.
This presentation was given in one of the DSATL Mettups in March 2018 in partnership with Southern Data Science Conference 2018 (www.southerndatascience.com)
KeyNote #DBInsights" on 7 April. My views on the DBAs fears, doubts and opportunities in the age of DevOps, Cloud, Big Data, Open Source, bi-modal IT, Pizza teams, you name it.
The New Frontier: Optimizing Big Data ExplorationInside Analysis
The Briefing Room with Dr. Robin Bloor and Cirro
Live Webcast on February 11, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=0ec1fa381886313cc06d841015c65898
As information ecosystems continue to expand, businesses are searching for ways to combine traditional analytics with a new source of insight: Big Data. But with data flooding in from all kinds of sources, fast access and performance at scale can easily become an issue. One effective approach for solving this challenge is data federation, a method that involves taking the analytical processing to the data, allowing streamlined access to multiple data sources without the expensive ETL overhead or building of semantic layers.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor as he explains how the prevalence of distributed data calls for a new approach to Big Data. He will be briefed by Mark Theissen of Cirro, who will tout his company’s Data Hub, a data federation solution that provides a single point of access to all enterprise data assets without excessive data movements, preprocessing or staging. He will discuss how data federation differs from virtualization and ETL approaches, and demonstrate how a Cirro deployment solves the analytics challenge of integrating data silos across the data center – and the cloud – using the BI tools you already have on your desktop for real-time distributed analytics.
Visit InsideAnlaysis.com for more information.
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
This presentation about Hadoop will help you learn the basics of Hadoop and its components. First, you will see what is Big Data and the significant challenges in it. Then, you will understand how Hadoop solved those challenges. You will have a glance at the History of Hadoop, what is Hadoop, the different companies using Hadoop, the applications of Hadoop in different companies, etc. Finally, you will learn the three essential components of Hadoop – HDFS, MapReduce, and YARN, along with their architecture. Now, let us get started with Introduction to Hadoop.
Below topics are explained in this Hadoop presentation:
1. Big Data and its challenges
2. Hadoop as a solution
3. History of Hadoop
4. What is Hadoop
5. Applications of Hadoop
6. Components of Hadoop
7. Hadoop Distributed File System
8. Hadoop MapReduce
9. Hadoop YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/introduction-to-big-data-and-hadoop-certification-training.
The Briefing Room with William McKnight and Actian
Live Webcast on October 14, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=135528d85baa96a07850bd35961d459d
Integrating Hadoop with existing data sources, workflows and analytics can be a real challenge. While some components, like Hive and Spark, can give SQL access to Hadoop data, there isn’t much that enables Hadoop to be treated as a genuine BI and analytics platform, capable of running multiple jobs that serve multiple users and multiple applications. But what if you could turn Hadoop into a versatile, high performance development platform, forgoing all the pain of figuring out how and where to manage big data?
Register for this episode of The Briefing Room to hear veteran Analyst William McKnight as he discusses the fairly swift evolution of Hadoop’s capabilities. He’ll be briefed by Jim Hare of Actian, who will tout his company’s latest addition to its Analytic Platform: Hadoop SQL Edition. He will show how Actian has leveraged Hadoop and its scale out file system to create a fully functioning platform, providing everything from an analytic database to machine learning.
Visit InsideAnlaysis.com for more information.
Applications in R - Success and Lessons Learned from the MarketplaceRevolution Analytics
Adoption of the R language has grown rapidly in the last few years, and is ranked as the number-one data science language in several surveys. This accelerating R adoption curve has been driven by the Big Data revolution, and the fact that so many data scientists — having learned R at university — are actively unlocking the secrets hidden in these new, vast data troves.
In this webinar David Smith, Chief Community Officer, will take a look at the growth of R and the innovative uses of R in business, government and non-profit sectors. Then Neera Talbert, Vice President, Professional Services will take you into the trenches of recent customer deployments and share best practices and pitfalls to avoid in deploying or expanding your own R applications.
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing by "Sampat Kumar" from "Harman". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
Presented at The Hawaii International Conference on System Sciences by Hong-Mei Chen and Rick Kazman (University of Hawaii), Serge Haziyev (SoftServe).
Agile Data Science is a lean methodology that is adopted from Agile Software Development. At the core it centers around people, interactions, and building minimally viable products to ship fast and often to solicit customer feedback. In this presentation, I describe how this work was done in the past with examples. Get started today with our help by visiting http://www.alpinenow.com
Big Data in Action – Real-World Solution ShowcaseInside Analysis
The Briefing Room with Radiant Advisors and IBM
Live Webcast on February 25, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=53c9b7fa2000f98f5b236747e3602511
The power of Big Data depends heavily upon the context in which it's used, and most organizations are just beginning to figure out where, how and when to leverage it. One key to success is integration with existing information systems, many of which still rely on relational database technologies. Finding ways to blend these two worlds can help companies generate measurable business value in fairly short order.
Register for this episode of The Briefing Room to hear Analysts Lindy Ryan and John O'Brien as they explain how the combination of traditional Business Intelligence with Big Data Analytics can provide game-changing results in today's information economy. They'll be briefed by Eric Poulin and Paul Flach of Stream Integration who will share best practices for designing and implementing Big Data solutions. They'll discuss the components of IBM BigInsights, and explain how BigSheets can empower non-technical users who need to explore self-structured data.
Visit InsideAnlaysis.com for more information.
This presentation was given in one of the DSATL Mettups in March 2018 in partnership with Southern Data Science Conference 2018 (www.southerndatascience.com)
KeyNote #DBInsights" on 7 April. My views on the DBAs fears, doubts and opportunities in the age of DevOps, Cloud, Big Data, Open Source, bi-modal IT, Pizza teams, you name it.
The New Frontier: Optimizing Big Data ExplorationInside Analysis
The Briefing Room with Dr. Robin Bloor and Cirro
Live Webcast on February 11, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=0ec1fa381886313cc06d841015c65898
As information ecosystems continue to expand, businesses are searching for ways to combine traditional analytics with a new source of insight: Big Data. But with data flooding in from all kinds of sources, fast access and performance at scale can easily become an issue. One effective approach for solving this challenge is data federation, a method that involves taking the analytical processing to the data, allowing streamlined access to multiple data sources without the expensive ETL overhead or building of semantic layers.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor as he explains how the prevalence of distributed data calls for a new approach to Big Data. He will be briefed by Mark Theissen of Cirro, who will tout his company’s Data Hub, a data federation solution that provides a single point of access to all enterprise data assets without excessive data movements, preprocessing or staging. He will discuss how data federation differs from virtualization and ETL approaches, and demonstrate how a Cirro deployment solves the analytics challenge of integrating data silos across the data center – and the cloud – using the BI tools you already have on your desktop for real-time distributed analytics.
Visit InsideAnlaysis.com for more information.
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
This presentation about Hadoop will help you learn the basics of Hadoop and its components. First, you will see what is Big Data and the significant challenges in it. Then, you will understand how Hadoop solved those challenges. You will have a glance at the History of Hadoop, what is Hadoop, the different companies using Hadoop, the applications of Hadoop in different companies, etc. Finally, you will learn the three essential components of Hadoop – HDFS, MapReduce, and YARN, along with their architecture. Now, let us get started with Introduction to Hadoop.
Below topics are explained in this Hadoop presentation:
1. Big Data and its challenges
2. Hadoop as a solution
3. History of Hadoop
4. What is Hadoop
5. Applications of Hadoop
6. Components of Hadoop
7. Hadoop Distributed File System
8. Hadoop MapReduce
9. Hadoop YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/introduction-to-big-data-and-hadoop-certification-training.
The Briefing Room with William McKnight and Actian
Live Webcast on October 14, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=135528d85baa96a07850bd35961d459d
Integrating Hadoop with existing data sources, workflows and analytics can be a real challenge. While some components, like Hive and Spark, can give SQL access to Hadoop data, there isn’t much that enables Hadoop to be treated as a genuine BI and analytics platform, capable of running multiple jobs that serve multiple users and multiple applications. But what if you could turn Hadoop into a versatile, high performance development platform, forgoing all the pain of figuring out how and where to manage big data?
Register for this episode of The Briefing Room to hear veteran Analyst William McKnight as he discusses the fairly swift evolution of Hadoop’s capabilities. He’ll be briefed by Jim Hare of Actian, who will tout his company’s latest addition to its Analytic Platform: Hadoop SQL Edition. He will show how Actian has leveraged Hadoop and its scale out file system to create a fully functioning platform, providing everything from an analytic database to machine learning.
Visit InsideAnlaysis.com for more information.
Applications in R - Success and Lessons Learned from the MarketplaceRevolution Analytics
Adoption of the R language has grown rapidly in the last few years, and is ranked as the number-one data science language in several surveys. This accelerating R adoption curve has been driven by the Big Data revolution, and the fact that so many data scientists — having learned R at university — are actively unlocking the secrets hidden in these new, vast data troves.
In this webinar David Smith, Chief Community Officer, will take a look at the growth of R and the innovative uses of R in business, government and non-profit sectors. Then Neera Talbert, Vice President, Professional Services will take you into the trenches of recent customer deployments and share best practices and pitfalls to avoid in deploying or expanding your own R applications.
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing by "Sampat Kumar" from "Harman". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
Presented at The Hawaii International Conference on System Sciences by Hong-Mei Chen and Rick Kazman (University of Hawaii), Serge Haziyev (SoftServe).
Agile Data Science is a lean methodology that is adopted from Agile Software Development. At the core it centers around people, interactions, and building minimally viable products to ship fast and often to solicit customer feedback. In this presentation, I describe how this work was done in the past with examples. Get started today with our help by visiting http://www.alpinenow.com
Big Data in Action – Real-World Solution ShowcaseInside Analysis
The Briefing Room with Radiant Advisors and IBM
Live Webcast on February 25, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=53c9b7fa2000f98f5b236747e3602511
The power of Big Data depends heavily upon the context in which it's used, and most organizations are just beginning to figure out where, how and when to leverage it. One key to success is integration with existing information systems, many of which still rely on relational database technologies. Finding ways to blend these two worlds can help companies generate measurable business value in fairly short order.
Register for this episode of The Briefing Room to hear Analysts Lindy Ryan and John O'Brien as they explain how the combination of traditional Business Intelligence with Big Data Analytics can provide game-changing results in today's information economy. They'll be briefed by Eric Poulin and Paul Flach of Stream Integration who will share best practices for designing and implementing Big Data solutions. They'll discuss the components of IBM BigInsights, and explain how BigSheets can empower non-technical users who need to explore self-structured data.
Visit InsideAnlaysis.com for more information.
In this slidedeck, Infochimps Director of Product, Tim Gasper, discusses how Infochimps tackles business problems for customers by deploying a comprehensive Big Data infrastructure in days; sometimes in just hours. Tim unlocks how Infochimps is now taking that same aggressive approach to deliver faster time to value by helping customers develop analytic applications with impeccable speed.
Big Data Tools: A Deep Dive into Essential ToolsFredReynolds2
Today, practically every firm uses big data to gain a competitive advantage in the market. With this in mind, freely available big data tools for analysis and processing are a cost-effective and beneficial choice for enterprises. Hadoop is the sector’s leading open-source initiative and big data tidal roller. Moreover, this is not the final chapter! Numerous other businesses pursue Hadoop’s free and open-source path.
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)Denodo
Watch full webinar here: https://bit.ly/3aWI8lt
Self-service is a major goal of modern data strategists. A successfully implemented self-service initiative means that business users have access to holistic and consistent views of data regardless of its location, source or type. As data unification and data collaboration become key critical success factors for organisations, data catalogs play a key role as the perfect companion for a virtual layer to fully empower those self-service initiatives and build a self-service data marketplace requiring minimal IT intervention.
Denodo’s Data Catalog is a key piece in Denodo’s portfolio to bridge the gap between the technical data infrastructure and business users. It provides documentation, search, governance and collaboration capabilities, and data exploration wizards. It provides business users with the tool to generate their own insights with proper security, governance, and guardrails.
In this session we will cover:
- The role of a virtual semantic layer in self-service initiatives
- Key ingredients of a successful self-service data marketplace
- Self-service (consumption) vs. inventory catalogs
- Best practices and advanced tips for successful deployment
- A Demonstration: Product Demo
- Examples of customers using Denodo’s Data Catalog to enable self-service initiatives
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)Denodo
Watch full webinar here: https://bit.ly/3uqcAN0
Self-service is a major goal of modern data strategists. A successfully implemented self-service initiative means that business users have access to holistic and consistent views of data regardless of its location, source or type. As data unification and data collaboration become key critical success factors for organizations, data catalogs play a key role as the perfect companion for a virtual layer to fully empower those self-service initiatives and build a self-service data marketplace requiring minimal IT intervention.
Denodo’s Data Catalog is a key piece in Denodo’s portfolio to bridge the gap between the technical data infrastructure and business users. It provides documentation, search, governance and collaboration capabilities, and data exploration wizards. It provides business users with the tool to generate their own insights with proper security, governance, and guardrails.
In this session we will cover:
- The role of a virtual semantic layer in self-service initiatives
- Key ingredients of a successful self-service data marketplace Self-service (consumption) vs. inventory catalogs
- Best practices and advanced tips for successful deployment
- A Demonstration: Product Demo
- Examples of customers using Denodo’s Data Catalog to enable self-service initiatives
Speaker: Franz Walder, Product Manager, panagenda
Abstract: panagenda reached out to 750+ professionals to share their company’s Domino application strategy. Join this session to find out what was most important to your peers and what challenges they had to overcome to make their project a success. Find out about the critical questions everybody should ask and have answers to throughout their project. Franz Walder presents the exciting results of the survey and explains what role analytics can play when tackling these challenges.
Real time analytics is a beautiful thing, especially if you can build it in quick, scalable & robust way. We built a digital command center for our marketing team, which provided real time analytics on social media, clickstream and google search term in a span of couple of months. This solution was entirely build on open source technologies, using a combination of Apache Nifi, Elastic search & Hadoop. Simple but very effective. In this presentation i would like to share the architecture, learning and business benefits of this solution.
RNUG 2020: Domino Application Strategy: Key insights for successful moderniza...panagenda
panagenda reached out to 750+ professionals to share their company’s Domino application strategy. Join this session to find out what was most important to your peers and what challenges they had to overcome to make their project a success. Find out about the critical questions everybody should ask and have answers to throughout their project. Franz Walder presents the exciting results of the survey and explains what role analytics can play when tackling these challenges.
Learn How Financial Services Organizations Can Use Big Data to Mitigate RisksMapR Technologies
Risk comes in a variety of forms including uncertainty in financial markets, legal liabilities, operational risk, fraud, and protection against external and internal attacks. Models are becoming increasingly granular and improving risk modeling is a high priority.
Review this presentation from Splunk and MapR to learn how you can study months’ or years’ worth of raw data from disparate sources, without sampling, to understand and reduce risk.
Presentation to Analytics Network of the OR Society Nov 2020Paul Laughlin
Presentation on 'The Softer Skills that Analysts need' presented by Paul Laughlin at a virtual event run for the Analytics Network group within the UK OR Society. Exploring Paul's 9 Step Model for effective analysis & explaining how Softer Skills are essential throughout that workflow.
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Greg Makowski
Describing a predictive data mining model can provide a competitive advantage for solving business problems with a model. The SSA approach can also provide reasons for the forecast for each record. This can help drive investigations into fields and interactions during a data mining project, as well as identifying "data drift" between the original training data, and the current scoring data. I am working on open source version of SSA, first in R.
Similar to Market Research Meets Big Data Analytics for Business Transformation (20)
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Market Research Meets Big Data Analytics for Business Transformation
1. Al Nevarez
Senior Manager, Business Analytics
LinkedIn
Sally Sadosky
Group Manager, Market Research
LinkedIn
Market Research Meets Big Data Analytics
for Business Transformation
The Market Research Conference
Orlando, FL
Nov 2-4, 2015
2. Agenda
1. Linkedin’s Business
2. Market Research & Customer Feedback at LinkedIn
3. Market Research Big Data
4. Big Data: talent, tools & process at Linkedin for MR
5. Low cost per answer with modern ETL (Extract, Transform, Load)
6. The value is in the JOIN
7. Reporting
8. Analysis: Traditional & Modern techniques
9. The Big Picture
9. Power the majority
of the world’s hires
Identify & engage
professionals with
relevant content
Social selling.
Transform cold
calls into warm
prospects
Hire Market Sell
Share content,
find, contact, and
learn more about
people at your
company
@Work
For our clients
10. At LinkedIn, we believe in:
1. Delivers on a singular value proposition in a world class way
2. Simple, intuitive and anticipates needs
3. Exceed expectations
4. Emotionally resonate
5. Change the user’s life for the better
12. NPS as a Measure of Loyalty
Post Launch
Tracking and
Evaluation
Member
Empathy
Opportunity
Identification and
Exploration
Idea Generation
Concept Definition
Product Definition
User Experience
and Usability
Go To Market
Product Launch
Post Launch
Tracking and
Evaluation
13. 13
How likely are you to
recommend LinkedIn to a
friend or a colleague?
NPS
14. 14
Area of Focus
Known to Self
Unknown to Others
Open
Hidden
Known to Linkedin Unknown to Linkedin
Known to Members
Unknown to Members
Discovery
Unknown
16. • 2000 completes per month per country
• Daily email sends
• Representative sample: # of visits per 90 days
• Members are kept anonymous
• Mobile ready
• In local language
• Results weighted by country
16
LinkedIn’s NPS and CSAT program
19
Top 9
Countries
17. Questionnaire Design
• Set a competitive context
• social networking, jobs sites, content
• NPS for each selected site
• Open-end about NPS rating
• CSAT product questions for LinkedIn
• Emotional driver questions for LinkedIn
• Open-end on what LinkedIn can do better
• Key demographics
• Re-contact permission ask
• Behavioral data appends (pre-prop)
23. Research Analysis Teams at Linkedin
1. Market research analysts
2. Business Analytics Data Scientists Al
24. Talent
Solutions
Marketing
Solutions
100 team members support 9000+ employees
Sales
Solutions
Premium
Subscriptions
Consumer
Marketing
Business Analytics
Business Operations & Analytics
CFO
CEO
Where is Business Analytics
in Linkedin’s organization ?
Market
Research
25.
26. Insights
What is the best
that could happen?
Intelligence
What will happen?
Information/Knowledge
Why did it happen?
Data
What happened?
Business ROI
Business analytics evolution: from data to transformation
Transformation & Change
Implement & monitor
27. Business models
Marketing, Sales, Recruiting
Targeting & Attribution
Customer experience
Communication/interpersonal skills
Statistics
Probability
Optimization
Modeling
Numerical analysis
Simulations
Analytics
A-B Testing
SQL, ETL, APIs,
relational database,
graph database,
software engineering,
tool building, web
applications, R, Python,
Data disualization,
data mining,
Machine Learning
Hadoop,
Spark, Hive, Pig
The business analytics staff - Complete Data Scientists
Business
Knowledge
Outcome = Data products
which many staff can leverage
29. Big Data Technical Themes
1. Efficient: Move the computation to the data
2. Shared foundation to build on with open source
3. Scalability (storage 1/10th the price of traditional)
4. Scalability (grow to multiple – thousands –
of processors with little cost)
5. Reliability (replicated data, failure survival)
6. Schema on read (save all data in raw form, NoSQL)
30. Components of Hadoop
3 areas
1. Data Storage HDFS: a network OS for the data, replication
2. Map reduce: Efficiently spreads the work
3. Hadoop libraries: Hive, Hbase, Pig….
32. Big Data Tools We Use Regularly at
Hadoop
Hive
Pig
Low cost storage
Unstructured data
Highly scalable processing
SQL-like query
Query Hadoop data
Massive result sets
Advanced processing
Advanced ETL
Data Flows
33. Map Reduce
Example: average a billion #s
Distribute to 1000 nodes > Get sum & count at each node >
Sum the sums and sum the counts > at end sum of sums / total counts
37. Sampling Data Workflow for Survey Research
Members &
Clients use:
Flagship Desktop
Mobile Apps
Talent solutions
Marketing solutions
Sales solutions
Application
Data storage
(Engineering)
ETL to DWH
(Data Services)
400mil members
• Sign ins
• Profile edits
• Language setting
• Product registrations
• Searches
• Publishing
Profile summaries
Aggregated data
Usage & Engagement
levels (daily visits)
Member segments
Survey history
Survey pre-pop data
Sample for
non-survey studies
Sample for
survey studies
SQL processes
Automated, some manual
Global
Daily, monthly or quarterly
Sampling strategy adjustments
Survey pre-pop data
Snapshot tables
SQL
(Marketing Operations)
Survey vendor
Snapshot
39. Some member data is anonymously passed (or obfuscated and
passed) to the survey vendor with the invitation list to support:
1. Survey branching
2. Survey quota management
3. Survey language
4. Light reporting on survey vendor’s reporting platform
Pass through or pre-pop
Field count: dozen or so
40. In addition to pre-pop data passed to the survey vendor,
internally we store “snapshot” values about each survey invitee.
1. Maintains a snapshot of the member’s full profile at the time
of survey
2. Private & internal to Linkedin
3. Used for internal NPS (general BI) analysis & dashboards
4. Used for data mining & pattern discovery
5. Used by many departments to understand members/clients’
activity at time of survey
6. Slice and dice by anything that comes up
7. Key = member id
Snapshot Profile Data
Field count: Hundreds
41. ETL Process for Low Cost Per Answer
from your survey results
42. ETL Process Before Big Data
Survey Vendor Data
Survey program A
Survey program B
Survey program C
Survey program D
Survey program E
Multiple Relational
Database Tables
Survey Table A
Survey Table B
Survey Table C
Survey Table D
Survey Table E
What if Survey B adds 5 questions and drops 3 questions ?
$ $ $ $
Schema A
Schema B
Schema C
Schema D
Schema E
43. ETL Process After Big Data
Survey Vendor
Survey program A
Survey program B
Survey program C
Survey program D
Survey program E
1 Simple relational
database table
… with just the data
we need for analysis and
dashboards
But ALL the data fully
available on Hadoop
for other studies
$
$
Schema
HDFS
44. Survey document
storage on HDFS
Record 1:
{
"record" : 8695,
"uuid" : "zzcxgtz2m0ahuzf2",
"date" : 1434475680000,
"start_date" : 1434475020000,
"customer_id" : "abd123",
”survey_fields" : {
"Q1_NPS" : "10",
"Q6_Driversr1" : "11",
"Q6_Driversr2" : "7",
"Q6_Driversr3" : "8",
"Q7_Productsatr1" : "8",
"Q7_Productsatr2" : "9",
"Q7_Productsatr3" : "10",
"wave" : 1,
"country" : 1,
"is_mobile" : 1,
"mobileos" : 3
"verbatim1": "Love Linkedin!"
"status" : 3
}
}
Schema
An example survey record (condensed)
Core key values are those that exists for
every survey record.
Under “survey_fields” we have the
survey specific fields.
DWH team only stores this.
The may be very different between
survey programs, and may change
for a given survey program. DWH
team doesn’t care.
45. Example PIG script to read from HDFS
survey_raw = LOAD '/data/external/survey_vendor/survey_program1/
survey_step1 = FILTER survey_raw BY survey_fields#'status' == '3';
survey_step2 = FOREACH survey_step1 GENERATE
(charArray) ‘survey_program1' AS suvey_program_id,
(charArray) uuid AS unique_response_id,
(charArray) id AS member_id,
(int) survey_fields#'vwave' as wave_field,
(int) survey_fields#'Q1_NPS' AS nps_value,
(charArray) survey_fields#'verbatim1' AS reason,
(int) survey_fields#'Q6_Drivers1',
(int) survey_fields#'Q6_Drivers2',
(int) survey_fields#'Q6_Drivers3',
(int) survey_fields#'Q7_Product_csat1',
(int) survey_fields#'V7_Product_csat2',
(int) survey_fields#'V7_Product_csat3',
(int) additionalinfo#'mobileos',
STORE survey_step2 INTO 'survey_nps' USING PigStorage('t');
Upload
To Teradata
46. Why is all this important? Because..
The Power is in the SQL JOIN
(and letting others join too)
select NPS_value, behavior1, behavior2
from nps_data a
inner join behavior1_data b
on a.customer_id = b.customer_id
inner join behavior2_data c
on a.customer_id = c.customer_id
NPS Data Behavior
1 Data
Behavior
2 Data
47. • What’s the NPS for each of our
member audience segments?
• What’s the NPS of members who
received our recent marketing
campaign and took action on it?
• What’s the NPS of software engineers
who have at least 5 skills, each with
more than 10 endorsements on their
profile?
Connect Stay Informed Get Hired
The JOIN allows us to answer questions in context of
business needs and customer experience
• What’s the satisfaction with our new
messaging tool for members who had
it enabled?
• What’s the NPS by region for
members who have purchased our
premium subscriptions?
• What’s the CRM record for B2B
customers who took our NPS survey?
• Which members scored highly on both
our member survey and our Talent
solutions survey?
51. Big Data Trends 2014
1. Uploadable, findable, shareable, real-time data
2. Sensors use rising rapidly.
3. Processing costs falling rapidly, while cloud rises
4. Beautiful new user interfaces, aided by data-generating
consumers – helping make data usable/useful
5. Data mining / analytics tools improving & helping
find patterns
6. Early emergence of data/pattern driven problem
solving
52. Data Mining or
Machine Learning Outcomes
1. Rank or prioritize a customer or prospect list
2. Replace or move assets or resources
3. Classify or segment
4. Rank drivers of a key metric
5. Categorize text
6. Generate a lift for a key metric
Why not: NPS, Promoters, CSAT ?
53. Data Mining Techniques
Commonly Used by the
Business Analytics Team on Market
Research & other Marketing data
• Decision Trees & Random Forest
• Generalized Boosted Models (GBM)
• Logistic Regression
• Stochastic Gradient Descent(SGD)
• Clustering
• Bayesian Networks
• Text Classification & Mining (LDA, NLP)
56. Tools for Provoking & Taking Action
56
1. Always-available NPS and CSAT Dashboards for anyone,
for any product line
2. Drill down analysis
3. Emotional driver prioritization
4. Product driver prioritization
5. Open ends or verbatims
6. Composition & waterfall analysis for studying changes
7. Deep pattern analysis and focus
57. The Big Picture on Why Big Data Matters to Market Research
Business
Knowledge
Market
Research
58. The Big Picture on Why Big Data Matters to Market Research
CustomersProduct
Market Research
59. The Big Picture on Why Big Data Matters to Market Research
Moore’s Law
60. We are hiring!
Linkedin Job Search on:
Linkedin Business Analytics
Market Research
Transform yourself
Transform the company
Transform the world
Our vision is to create economic opportunity
for every member of the global workforce.
Thank you from
Al Nevarez
Sally Sadosky
Editor's Notes
Facebook not only “regularly polls its members about their Facebook experience” but also has created a Facebook Feedback Panel to
harvest the type of longitudinal research that many skeptics have already claimed as dead.
Despite its massive amount of user data, clearly Facebook sees the value in directly surveying its members.
May drop the vison slides and just go to the Mission slides to shorten this section and get the Member more quickly.
Mission: Connect the world’s professionals to make them more productive and successful
HQ in Mountain View, CA, with offices in 30 cities around the world
Linkedin is available in 23 languages
Linkedin has more than 6,000 full time employees
Over $2bil revenue last year
A set of circumstances that makes it possible to do something
A chance for employment or promotio
Ecosystem. Members feed the business, the business feeds the Members. Market research and Analytics play a role in helping to build and strength those bridges
Need to ask Al what the graphic means with google amazon, etc
Now LinkedIn is the world biggest professional social network.
7 member segments: students, career starters, career builders, senior leaders, small business olders,
- Not every member is our most active member. We look at people who are active. People who are not as active have as much to say as most active. We can learn from everyone.
LinkedIn's mission is to connect the world professionals to make them more productive and successful. The most important word in this mission is "professionals". How we achieve this mission? We focus on the following 3 areas.
Professional identity, professional networks and knowledge
In terms of Professional identity
We want to have an up-to-date professional record to represent our experience, skills and perhaps most importantly our ambitions.
Professional networks, it is about connecting all world professionals. Our network connections help us to find career opportunities, business opportunities. We can keep in touch, or get back in touch with our old classmates, co-workers.
Knowledge,
Members leverage LinkedIn to express and exchange knowledge, as a professional publish platform.
Hire, Market, Sell
For our enterprise customers, we focus on hire, marketing and sale
Hire, help enterprise to find and attract great talent, target the right person with the right job
Marketing, Engage members with relevant and meaningful content at scale.
Sell, find and engage buyers, use your company's connections to get warm introductions.
Call out the 5 principles of building great products….also say this is what drives research….particularly exceeding expectaions and creating an experience that emotionally resonates with our Members
At LI we follow a very traditional approach to both Market Research and User Experience Research.
Using both qualitative and quantitative techniques we
Identify opportunities in the market place
We design and build a product or set of products that meet the members or customers needs
We develop a go to market strategy
We measure our success through NPS and member follow up research
One of the huge benefits at LI is that we have attitudinal data through my surveys as well as the behavioral data that we collect as member use the site. Having this additional massive set of beavhior metrics has fundamentally changed how I think about research in many ways.
But for today, I want to focus on just one area in this cycle The Post Launch Tracking and Evaluation.
Our CEO Jeff Weiner comes from Yahoo where NPS was used as the metric to measure success and loyalty. Jeff is a huge fan of NPS and believes that the higher bar of 9 and 10 being promoters versus 0-6 being detractors helps to focus product teams on his 5 operating principles.
Over the past 2 years of setting up the market research department at LI, the one key learning is that Loyalty and Satisfaction of LI has as much to do with the heart as it does with the mind or the product. This is where the analysis that Al and I have been working on really gets interesting
In marketing, we have the dual challenge of reaching the hearts and minds of our members so they take action, so they engage with Linkedin. This became evident when we started to collect NPS data and verbatims of why they gave us the Likelihood to Recommend rating that they did.
You can think of the mind as the analytical, the measuring side of our conscious.
(click to appear) We have some tools to help our staff measure our member’s engagement & loyalty (go/nps), to read about their interactions with our products (go/voices),
and more tools to come this year from business analytics and market research to help us to discover the patterns which cause these metrics to move.
But unless you’re a product manager, or doing our marketing team’s work to understand our members, why should you care?
How can we provoke the broader staff at Linkedin to know, to feel, and to do more for our members?
How about if we broke free of the thinking side, and just went for the heart.
We came up with one idea, called go/memberfeelin, which you can see scrolling by over there….
AL:
Besides possibly generating some empathy & awareness, why is this important?
I like this tool called Johari’s window for understanding personal relationships.
And I think it can work for Linkedin in general.
There are things that you know about yourself, and some of those are known to others, some are unknown to others.
(click) We might classify these situations as Open and Hidden.
Let’s replace “Self” with “Linkedin”. (click) Same idea.
(click) Now let’s consider the things that are Unknown to Linkedin
When Unknown to us and but known to our members we can consider these situations where we (Linkedin) are Blind.
Unknown to both is just unknown.
(click) We don’t want to be blind (highlight box).
Here is go/memberfeelin
Go over to the other screen (leave Johari’s window up on screen #1)
Demo
A user interface that highlights emotional and personal words stated by our members
These are from our daily Global NPS survey, answers to the “Why did you give Linkedin in that NPS score” question
We found 19 terms which we felt would highlight very personal comments
It’s currently running in what we called automatic mode, and it cycles through 5 comments for each word then goes onto the next word.
You can also click any particular word or term on the left, and we’ll focus on that.
We think the feature of anchoring the verbatim on the emotional word, helps to read and get to the heart of the member’s comment.
On the right, we indicate the cm segment, their NPS rating level, and when they stated the comment. This goes back through all of 2014.
Over the past 2 years of setting up the market research department at LI, the one key learning is that Loyalty and Satisfaction of LI has as much to do with the heart as it does with the mind or the product. This is where the analysis that Al and I have been working on really gets interesting
In marketing, we have the dual challenge of reaching the hearts and minds of our members so they take action, so they engage with Linkedin. This became evident when we started to collect NPS data and verbatims of why they gave us the Likelihood to Recommend rating that they did.
We found product improvements for sure….every company does, but we were finding a really strong predictive model based on survey data or behavioral data.
This is when Al and I really started to combine forces to understand loyalty based on behaviors as well as emotional impact.
For LinkedIn’s member NPS (we have a similar but customized process for each paid product): We do the following…go through the slide
Set up the questionniare design
About 8 minutes to complete
76% give us some input for the open ends…either one or both
Over 80% agree to be re-contacted….
Follow ups:
we will filter some to our customer service is they indicate a huge problem;
We send regions their own data and have the marketing teams work in the local language
We have a member call program from the recontacts
We develop additional surveys as necessary
Market research and big data. What’s the big deal?
Though we enjoy reading these interesting articles. Well we don’t believe any of them.
At Linkedin, we recognize it’s crucial to work together.
Why’s that? Well, at linkedin, we have the priveledge of having lots of data. Coupled with high standards for our the privacy of our member data.
As a result, we build internal tools that can handle the scale, the social graph, the economic graph with all it’s companies, employees, skills, jobs, education, and knowledge
So you can image, having big data tools that combine this abundance of data, are quite welcome
400million registered users
about 100 mil of them visiting every quarter
Nearly 40 billion page views in the last quarter
Linkedin doesn’t make any physical products. Our value is derived purely from information.
So I think market research and big data can work together.
At a place Like Linkedin, there’s a healthy cooperation regarding tools and resources and analytics horsepower
Why.. All this data is a reflection of who we are
A new Ecosystem has been created. Storage computing power, web frameworks, and social network
We all strive to understand our customers better. The opportunity has never been greater.
So how do we gear ourselves up to do this
We live in a world where innovation, new companies, new projects are driven by customer needs.
Besides the designers at Linkedin, we have a number of analysis roles to support this modern driver.
At Linkedin, we’re serious about the business in business analytics.
So much so, that we actually report to the CFO.
My department supports all the major product lines at Linkedin.
Helping sales, marketing, and operations in each to be more efficient. To help us know more
And to help us know our members and clients more.
But we don’t just do adhoc, 1 time analysis and deliver a deck…
Why is our department critical?
Back in the 70s there was a study to find the most efficient living thing in the world.
A study of the energy needed to get from point A to point B.
The condor won. It Took the least amount of energy to get from A to B
Man didn't do so well, unimpressive at about a 1/3 of the way down the list.
But someone doing the research was insightful enough to test Man on a Bicycle
MwB won. twice as good as the condor.
Tools
But even with tools.. Still need the doctor
Our philosophy and steps for going from data to transformation
The yellow elephant has come to symbolize big data. arising from the Hadoop creator’s son’s toy elephant
Where does Big Data come in. Big data has many facets. It’s also a new spirit of data management and analysis.
The yellow elephant has become a symbol of this movement, Is it hype?
"There is a need to bring market research data to Hadoop and Hadoop to market research data.”
Here are some technical themes regarding big data
Does this sound like hype? I don’t know. It seems quite useful to me.
Map reduce is efficient
Instead of moving the data to the computation, it moves the computation to the data.
What are the themes of big data?
1/10th the price of traditional data
For the same price you can store 10x as much
Deploy to multiple processors with little cost.
What is amazing about this is that it scales horizontally. If we double the number of machines, then (ignoring certain fixed-costs of running a MapReduce system) our computation should run approximately twice as fast. Each mapper machine will only need to do half as much work, and (assuming there are enough distinct keys to further distribute the reducer work) the same is true for the reducer mac
3 areas
The first 2 are the true magic here
Hadoop is a generic processing framework designed to execute queries and other batch
read operations against massive datasets th at can be tens or hundreds of terabytes and
even petabytes in size. The data is loaded into or appended to the Hadoop Distributed
File System
(HDFS)
. Hadoop then performs brute force scans through the data to
produce results that are output into ot
her f
iles
It enables applications to work with thousands of computational independent computers and petabytes of data.
Acid: atomic consistent, isolated durable
I found it interesting how Facebook's blended the use of both traditional SQL data stores such as Oracle and MySQL and NoSQL solutions such as Hive as part of their overall solution.
Is an entire ecosystem of integrated distributed computing tools, at the core of which are a file system (HDFS) and a programming framework (Map-Reduce).
Big data comes with many new products. And they have fun names.
Don’t get overwhelmed. Be thankful actually.
It’s a scramble for innovation. A scramble to make it easier.
We only use a few of these tools, Hadoop, Hive, Pig, and get tremendous value from them.
Like a Data level operating system <<
Hadoop operates on massive datasets by horizontally scaling the
processing across very large numbers of servers through an approach called MapReduce.
Think about all the math in which you use a summation. What if all the things you were summing couldn’t fit on one machine.
Or what if you could do chunks of the summation all in parallel, then bring the mini-results together later for a final result. That’s map reduce.
Running on the magic of HDFS.. It brings the processing to the data, rather than the data to a single processor.
Hundreds or thousands of small, inexpensive, commodity all executing in parallel. Using the MapReduce
approach, Hadoop splits up a problem, sends the sub-problems to different servers, and
lets each server solve its sub-problem in parallel. It then merges all the sub-problem
solutions together and writes out the solution into
files which may in turn be used as inputs into additional MapReduce steps
.
It enables applications to work with thousands of computational independent computers and petabytes of data.
Acid: atomic consistent, isolated durable
Is an entire ecosystem of integrated distributed computing tools, at the core of which are a file system (HDFS) and a programming framework (Map-Reduce).
Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.
Example.
If a number uses 7 bytes
700000000/1024^3
1 bil
7000000000/1024^3 = 6.5 gig bytes my laptop has 16gb ram
10 bil
65 gigbytes
Extract, transform, load
Here’s the bottom line on why this matter
HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications
n a Hadoop cluster, data is distributed to all the nodes of the cluster as it is being loaded in. The Hadoop Distributed File System (HDFS) will split large data files into chunks which are managed by different nodes in the cluster. In addition to this each chunk is replicated across several machines, so that a single machine failure does not result in any data being unavailable. An active monitoring system then re-replicates the data in response to system failures which can result in partial storage. Even though the file chunks are replicated and distributed across several machines, they form a single namespace, so their contents are universally accessible.
Linkedin, like facebook and others blend the use of both traditional SQL data stores such as Oracle & Teradta and Hadoop
based solutions such as Hive as part of their overall solution.
The big data storage is essential for our business, which is logging sign ins, profile updates, searches, etc.
These are all evidence of engagement, which we use in our sampling strategy
For example:
If we want to plot or data mine NPS broken out by our member segments, we can join the snapshot data with NPS to have the actual segment at the time each member took the survey. Not the new segment the member may have moved into since.
Data warehouse IT time is expensive.
Setting up a new survey program requires creating new database tables, and designing the schema for each table
Changes to the data structure
From Vendor to storage is now very low cost, including automation from your DWH team
Assuming the vendor has an API, passing data to Hadoop is quite inexpensive at this point.
And if survey design changes for any of A to E, no problem, hadoops unstructured data storage nature handles it fine
No schema to redesign
And a 1 time setup of a process to ETL from hadoop to database table. It’s generally immune to changes in survey design, e.g. new fields, etc.
Your DWH team will love it. And you can do it within budget.
The power is in the join
The join is powerful.
Here are some example of the sort of questions we can answer easily because
We have the data in our database, linked with marketing data, with our CRM data, etc.
Allows me to ask questions like:
I need a list of our power users/lovers for an upsell campaign:
Who are the members who scored highly on our member NPS survey, and also score high on our Sales solutions survey
And have added 10+ connections in the past week, and have marked at least 10 leads in the past week.
This is one example of a Tableau dashboard we created, for helping any part of our business monitor their NPS score
Trends, and verbatims
The beauty behind this is that despite the multiple survey program and survey source, we condensed all this
Data into 1 Teradata table, and make it quite easy for a tool like tableau to handle.
Each year, Mary Meeker, of the VC firm Kleiner Perkins Caufield Beyers, publishes a comprehensive and always interesting 100+ page deck on Internet Trends.
Meeker’s 2014 edition has a section on Big Data where she lists the following six trends:
1. Uploadable, findable, shareable, real-time data2. Sensors use rising rapidly.3. Processing costs falling rapidly, while cloud rises4. Beautiful new user interfaces, aided by data-generating consumers – helping make data usable/useful5. Data mining / analytics tools improving & helping find patterns6. Early emergence of data/pattern driven problem solving
Sure, I’ll have some of that.
Traditional quad chart
Importance calculated via:
Correlation analysis, partial correlation analysis, bayesian networks with sensitivity analysis
Analytics have little power until they inform a decision
To conclude
In a world where marketing is increasingly about listening to your customers and
“meeting their needs, you need to find a way to, both, do that well and do it efficiently”
And to wrap up, on a high level as to why big data matters for market research
Remember this venn diagram I showed regarding the ideal data scientist earlier.
Well It’s the same image for the modern market researcher.
Let’s keep our role and our skills in perspective
But that’s not all
I argue that Moore’s law is why all of this matters.
We continue to do more and more with microprocessors. Big data technologies like Hadoop and it’s HDFS
have created a step change in what’s possible for our products and our customers, for our market research.
Couple this with the rapidly falling price of storage
As market researchers, we can help our companies realize new opportunities, even new business models.
Moore: the number of transistors in a dense integrated circuit has doubled approximately every two years.
The period is often quoted as 18 months because of Intel executive David House, who predicted that chip performance would double every 18 months (being a combination of the effect of more transistors and their being faster)