TIPS from the Experts
Table of Contents 
Setup is Key 
Think wide 
Tool integration 
Evaluate and Adapt 
Sharing 
Encryption 
A data science mindset 
Innovation 
Real-time action
To see all of the tips in list 
form, click the button on the 
bottom of the slide. 
See in List Form
Grant Unlimited Access 
Create a data lake and give your business and 
data analysts access to all your data – 
structured and unstructured – with SQL engines 
like Hive. They will surprise you with the insight 
and value they can extract, and your 
development team will have less work 
answering ad-hoc queries. 
“ “ 
—Christian Prokopp, Principal Consultant at Big Data Partnership 
See in List Form
Select the Right Tools 
Very often the query is when to use 
MapReduce/Pig/Hive vs. HBase/Cassandra/Impala 
frameworks. NFR (Non Functional Requirements) 
have to be considered while deciding the 
framework. MapReduce/Pig/Hive are used for high 
throughput/high latency requirements as in the 
case of Batch processing/ETL. 
HBase/Cassandra/Impala are used for low 
throughput/low latency requirements as in the case 
of a customer filling out an online application. 
“ “ 
—Praveen Sripati, Hadoop trainer and author of Hadoop Tips 
See in List Form
Improve query performance by considering 
Presto with RCFile or ORC File format. 
Use Presto 
“ “ 
—Minesh Patel, Qubole 
See in List Form
Incorporate Machine 
Learning 
Use Robust Machine Learning Algorithms to 
extract the data – Data collection and massive 
storing is only the enabling infrastructure. You 
should leverage existing and also propriety 
machine learning algorithms, that will discover 
hidden patterns, and will learn from the data 
what is important for the analyst to view and 
examine, and what is not. 
“ “ 
—Idan Tendler, CEO of Fortscale 
See in List Form
Automation is Key 
There is a big need for automation in Big Data. 
Security is an important industry that has 
proven the value of Big Data. But, that has just 
as quickly proved that Big Data is also valueless 
without automation wrapped around it to make 
it practical. Only once you make Big Data 
practical can you begin to perform analytics, 
etc., which is where the value of Big Data in the 
security industry really gets unlocked. 
“ “ 
—Sean Brady, VP of Product Management at Vorstack 
See in List Form
Identify Easy Wins 
Segment the data based on demographic 
and/or firmographic information. This is an easy 
and inexpensive way to highlight trends in the 
primary customers and industries served. This 
information is very helpful when determining 
what new products and/or services should be 
offered. In addition, look for trends in 
behavioral transaction information and further 
optimize the customer’s experience with 
relevant marketing and messaging. 
“ “ 
—David Handmaker, CEO of Next Day Flyers 
See in List Form
Think Broad 
Identify all of the data you have access to and/or will 
produce, and explore possible audiences and use 
cases for it. Often times, big data plays are geared 
toward a fairly narrow audience and set of use 
cases based on the original inspiration for the 
solution. Or, there is not an active and explicit 
exploration of the full potential of what you have to 
offer. I can all but assure you that there are major 
opportunities for your offering that you haven’t 
even considered yet. The earlier you have a crisp 
view of the potential of your big data and offering, 
the better able you will be to build the right thing, in 
the right way, to exploit the potential of that idea. 
“ “ 
—Dirk Knemeyer, founder of Involution Studios 
See in List Form
Setup is Key
Careful and Smart Integration 
with BI tools 
Big Data tools ( Mapreduce/Hive etc. ) are known for 
their latency problems, but on the other hand they are 
excellent for processing petabytes of data in a 
distributed computing environment. When it comes to 
integration with any BI/reporting tools, big data 
technologies should be used in an appropriate manner 
so that you can avoid the negatives and leverage the 
strength of these technologies. 
For example – if you are building an integrated pipeline 
with BI tools, try to aggregate as much as you can and 
utilize the caching or cube technologies with the BI tools 
to make it a faster experience for the end user. Real 
time connectivity with big data sources like Hive/HDFS is 
not a great end user experience in the BI space, so it 
should be avoided. 
“ “ 
—Ashish Dubey, Solutions Architect at Qubole 
See in List Form
Invest in Your Pipeline 
Rule of thumb, invest 80% of your time in your 
data lake and data pipeline (mining, extracting, 
cleaning, transforming, loading), and 20% in the 
high level data science and machine learning 
effort. Data in the wild is complex, wrong, 
contradicting, hard to access and find. 
Consequently more, faster, and accurate data 
usually has a higher impact than more complex 
models and makes for a robust system. 
“ “ 
—Christian Prokopp, Principal Consultant at Big Data Partnership 
See in List Form
Don’t Rush Into Analysis 
Everyone with a Big Data project wants to rush 
straight into analysis. That is where things 
usually fall apart, however, because there is 
simply too much data flowing across the 
network and it is mostly in a format that 
current analytics software cannot handle. 
“ “ 
—Rick Aguirre, president of Cirries Technologies 
See in List Form
Start with Heavy Lifting 
Big Data success requires three steps of heavy lifting first, 
before you ever analyze it. 
Step 1 is data capture. 
Most of the Big Data torrent is a big nothing and not relevant. 
Decide what data you want to analyze and set up algorithms to 
locate and corral it. 
“ “ 
Step 1 is data control. 
You want to capture the data you need as it come 
across the network. It may not be relevant in just a few minutes, 
or you may need to store it for a number of years if, as one 
example, it is data that might be needed later for law 
enforcement purposes. 
Step 1 is data humanization. 
This is where you convert whatever format the data is in to a 
format that your analytics software can use. Only now, at this 
step, do you have the right data in the right format that you can 
then use for whatever kind of analytics you have in mind. 
—Rick Aguirre, president of Cirries Technologies 
See in List Form
Once data is collected then you have easy 
access for advanced analytics – don’t stop at 
only analyzing one log source or one dimension 
of data – analyze across log sources and 
multiple entities. For example, in order to 
discover advanced cyber attacks that leveraged 
users’ credentials, we profile across behavioral 
activity of users – including their permissions 
configuration, their access to files and systems 
and their web activity. We analyze their 
historical activity as well as comparing them 
against their peers. 
Think wide 
“ “ 
—Idan Tendler, CEO of Fortscale 
See in List Form
Use the ODBC Driver 
Perform BI Analytics and Visualization 
with the ODBC Driver. 
“ “ 
—Minesh Patel, Qubole 
See in List Form
Use a Subsample 
I always start by looking at a subsample of the 
data. You often get a very good impression of 
what the main focus of the data munging or 
cleaning will be just by looking at some 
numbers (or characters). 
“ “ 
—Benedikt Koehler, Data Scientist and Blogger at Beautiful Data 
See in List Form
Evaluate and Adapt
Measure Everything 
Measure and record everything, and keep an 
eye on your key metrics. Things change, and 
tests become obsolete, and sometimes in 
surprising ways especially when you depend 
on external data. For example, data sources 
you mine may introduce rolling changes, which 
are hard to catch as an error but easy to 
identify in metrics. 
“ “ 
—Christian Prokopp, Principal Consultant at Big Data Partnership 
See in List Form
Sharing is Caring 
Measure and record everything, and keep an 
eye on your key metrics. Things change, and 
tests become obsolete, and sometimes in 
surprising ways especially when you depend 
on external data. For example, data sources 
you mine may introduce rolling changes, which 
are hard to catch as an error but easy to 
identify in metrics. 
“ “ 
—Idan Tendler, CEO of Fortscale 
See in List Form
Encrypting data at rest is a good 
best practice. 
Encryption 
“ “ 
—Minesh Patel, Qubole 
See in List Form
Pick the Right Distribution 
A common question is whether to go for a 
distribution from Apache or a vendor. When 
there is enough expertise in the organization to 
know the internals of the different frameworks 
for integrating and resolving any issues quickly, 
then go with Apache Hive. If that expertise is 
not available, use a distribution through a 
vendor and get commercial support to resolve 
any issues that may arise. 
“ “ 
—Praveen Sripati, Hadoop trainer and author of Dattamsha 
See in List Form
Developing a Big Data strategy is all about 
starting small and making gradual steps in 
becoming more data-driven. Start with 
breaking down the data silos within your 
organization to gain the most insights from 
your data when you start analyzing it 
through a variety of tools. 
Start Small 
“ “ 
—Mark van Rijmenam – CEO / Founder BigData-Startups 
See in List Form
Have a Business Intent 
There is often a perception that there is gold in 
an organization’s data, and that if you just look 
hard enough, you will find it. In reality, this 
perception can lead to fruitless efforts with no 
real direction and no payoff. Instead, start with 
a business intent in mind. What are the actions 
you would take—and the value to your 
business—if data can provide the answer to a 
certain question? 
“ “ 
—Sean Stauth, Director, Client Services, Silicon Valley Data Science 
See in List Form
Update Your Strategy 
Your data strategy should be a living document 
that helps you get the most value from your 
data. As your goals, your technical environment, 
or the market change, keep it updated to help 
you follow those changes and stay on course. 
“ “ 
—Scott Kurth, VP, Advisory Services, Silicon Valley Data Science 
See in List Form
A Data Science Mindset
Data Science Mindset 
Have an always-on data science mindset — 
Successful big data initiatives start with a holistic 
360 view of the problem space. This includes 
understanding the inputs (data types, sources, 
features), the desired outputs (decisions, goals, 
predictions), and the constraints (model 
parameters, boundary conditions, optimization 
constraints). To achieve this perspective, one must 
be thinking like a scientist from start to finish: 
collect data, infer a testable hypothesis, design an 
experiment, test and evaluate the results, refine 
your hypothesis, and repeat (if necessary). 
“ “ 
—Kirk Borne, Data Scientist, Astrophysicist and 
Big Data Science Consultant 
See in List Form
Return on Innovation 
The most important ROI in Big Data Analytics 
projects is Return On Innovation. What are you 
doing that’s different and consequential? What 
sets you apart from the rest of the multitudes in 
this space? 
“ “ 
—Kirk Borne, Data Scientist, Astrophysicist and 
Big Data Science Consultant 
See in List Form
Focus on the Users 
Developing a big data platform requires focusing 
on the users. Serve a few users well, and let their 
processing scale up with your capabilities. 
“Premature platformization” or trying to satisfy too 
many use cases too early in the project leads to 
failures. Make the initial users successful, and the 
ecosystem will thrive and grow. 
“ “ 
—Owen O’Malley – Sr. Architect and Co-founder of Hortonworks 
See in List Form
Using the API: samples for Java SDK, 
Python SDK, and REST. 
Use the API 
“ “ 
—Minesh Patel, Qubole 
See in List Form
Take Real-Time Action 
If you cannot take real-time action, you have 
no need of real-time processing. There will 
always be batch processing workloads 
supporting the enterprise, and increasingly 
dynamic decision areas can be effectively 
supported by analytical systems because of 
advances in data architectures. 
“ “ 
—Sanjay Mathur, CEO, Silicon Valley Data Science 
See in List Form
Store Denormalized State 
State—the full context of an event, like a 
customer visit or the completion of a step in 
a manufacturing process—can be expensive 
to reassemble after the fact. This is 
particularly true with highly relational 
systems: witness the complex ETL (extract, 
transform, load) workloads that enterprise 
data warehouse systems struggle to scale. 
Storing denormalized state, e.g. rich logs, for 
analysis has proven highly successful for the 
web businesses of silicon valley, and those 
techniques can be applied to industries 
across the economy. 
“ “ 
—John Akred, CTO, Silicon Valley Data Science 
See in List Form
Build a Common Platform 
Whether you are thinking about migrating 
towards Big Data or whether you are just 
starting out with data all together, it helps to 
focus upon building and maintaining a 
common platform. Similar to software 
development platforms, data platforms 
should also include source control, change 
management, and testing scenarios. This will 
help reduce future migration costs and will 
lead to long-term sustainable, competitive 
data capabilities. 
“ “ 
—Ryan Kirk, SR. Data Scientist at Hipcricket 
See in List Form
Looking for additional big data tips and advice? 
Subscribe to Qubole's email newsletter. 
Sources: 
http://www.qubole.com/new-series-big-data-tips/ 
http://www.qubole.com/setup-is-key/ 
http://www.qubole.com/evaluate-and-adapt/ 
http://www.qubole.com/data-mindset/

Expert Big Data Tips

  • 1.
  • 2.
    Table of Contents Setup is Key Think wide Tool integration Evaluate and Adapt Sharing Encryption A data science mindset Innovation Real-time action
  • 3.
    To see allof the tips in list form, click the button on the bottom of the slide. See in List Form
  • 4.
    Grant Unlimited Access Create a data lake and give your business and data analysts access to all your data – structured and unstructured – with SQL engines like Hive. They will surprise you with the insight and value they can extract, and your development team will have less work answering ad-hoc queries. “ “ —Christian Prokopp, Principal Consultant at Big Data Partnership See in List Form
  • 5.
    Select the RightTools Very often the query is when to use MapReduce/Pig/Hive vs. HBase/Cassandra/Impala frameworks. NFR (Non Functional Requirements) have to be considered while deciding the framework. MapReduce/Pig/Hive are used for high throughput/high latency requirements as in the case of Batch processing/ETL. HBase/Cassandra/Impala are used for low throughput/low latency requirements as in the case of a customer filling out an online application. “ “ —Praveen Sripati, Hadoop trainer and author of Hadoop Tips See in List Form
  • 6.
    Improve query performanceby considering Presto with RCFile or ORC File format. Use Presto “ “ —Minesh Patel, Qubole See in List Form
  • 7.
    Incorporate Machine Learning Use Robust Machine Learning Algorithms to extract the data – Data collection and massive storing is only the enabling infrastructure. You should leverage existing and also propriety machine learning algorithms, that will discover hidden patterns, and will learn from the data what is important for the analyst to view and examine, and what is not. “ “ —Idan Tendler, CEO of Fortscale See in List Form
  • 8.
    Automation is Key There is a big need for automation in Big Data. Security is an important industry that has proven the value of Big Data. But, that has just as quickly proved that Big Data is also valueless without automation wrapped around it to make it practical. Only once you make Big Data practical can you begin to perform analytics, etc., which is where the value of Big Data in the security industry really gets unlocked. “ “ —Sean Brady, VP of Product Management at Vorstack See in List Form
  • 9.
    Identify Easy Wins Segment the data based on demographic and/or firmographic information. This is an easy and inexpensive way to highlight trends in the primary customers and industries served. This information is very helpful when determining what new products and/or services should be offered. In addition, look for trends in behavioral transaction information and further optimize the customer’s experience with relevant marketing and messaging. “ “ —David Handmaker, CEO of Next Day Flyers See in List Form
  • 10.
    Think Broad Identifyall of the data you have access to and/or will produce, and explore possible audiences and use cases for it. Often times, big data plays are geared toward a fairly narrow audience and set of use cases based on the original inspiration for the solution. Or, there is not an active and explicit exploration of the full potential of what you have to offer. I can all but assure you that there are major opportunities for your offering that you haven’t even considered yet. The earlier you have a crisp view of the potential of your big data and offering, the better able you will be to build the right thing, in the right way, to exploit the potential of that idea. “ “ —Dirk Knemeyer, founder of Involution Studios See in List Form
  • 11.
  • 12.
    Careful and SmartIntegration with BI tools Big Data tools ( Mapreduce/Hive etc. ) are known for their latency problems, but on the other hand they are excellent for processing petabytes of data in a distributed computing environment. When it comes to integration with any BI/reporting tools, big data technologies should be used in an appropriate manner so that you can avoid the negatives and leverage the strength of these technologies. For example – if you are building an integrated pipeline with BI tools, try to aggregate as much as you can and utilize the caching or cube technologies with the BI tools to make it a faster experience for the end user. Real time connectivity with big data sources like Hive/HDFS is not a great end user experience in the BI space, so it should be avoided. “ “ —Ashish Dubey, Solutions Architect at Qubole See in List Form
  • 13.
    Invest in YourPipeline Rule of thumb, invest 80% of your time in your data lake and data pipeline (mining, extracting, cleaning, transforming, loading), and 20% in the high level data science and machine learning effort. Data in the wild is complex, wrong, contradicting, hard to access and find. Consequently more, faster, and accurate data usually has a higher impact than more complex models and makes for a robust system. “ “ —Christian Prokopp, Principal Consultant at Big Data Partnership See in List Form
  • 14.
    Don’t Rush IntoAnalysis Everyone with a Big Data project wants to rush straight into analysis. That is where things usually fall apart, however, because there is simply too much data flowing across the network and it is mostly in a format that current analytics software cannot handle. “ “ —Rick Aguirre, president of Cirries Technologies See in List Form
  • 15.
    Start with HeavyLifting Big Data success requires three steps of heavy lifting first, before you ever analyze it. Step 1 is data capture. Most of the Big Data torrent is a big nothing and not relevant. Decide what data you want to analyze and set up algorithms to locate and corral it. “ “ Step 1 is data control. You want to capture the data you need as it come across the network. It may not be relevant in just a few minutes, or you may need to store it for a number of years if, as one example, it is data that might be needed later for law enforcement purposes. Step 1 is data humanization. This is where you convert whatever format the data is in to a format that your analytics software can use. Only now, at this step, do you have the right data in the right format that you can then use for whatever kind of analytics you have in mind. —Rick Aguirre, president of Cirries Technologies See in List Form
  • 16.
    Once data iscollected then you have easy access for advanced analytics – don’t stop at only analyzing one log source or one dimension of data – analyze across log sources and multiple entities. For example, in order to discover advanced cyber attacks that leveraged users’ credentials, we profile across behavioral activity of users – including their permissions configuration, their access to files and systems and their web activity. We analyze their historical activity as well as comparing them against their peers. Think wide “ “ —Idan Tendler, CEO of Fortscale See in List Form
  • 17.
    Use the ODBCDriver Perform BI Analytics and Visualization with the ODBC Driver. “ “ —Minesh Patel, Qubole See in List Form
  • 18.
    Use a Subsample I always start by looking at a subsample of the data. You often get a very good impression of what the main focus of the data munging or cleaning will be just by looking at some numbers (or characters). “ “ —Benedikt Koehler, Data Scientist and Blogger at Beautiful Data See in List Form
  • 19.
  • 20.
    Measure Everything Measureand record everything, and keep an eye on your key metrics. Things change, and tests become obsolete, and sometimes in surprising ways especially when you depend on external data. For example, data sources you mine may introduce rolling changes, which are hard to catch as an error but easy to identify in metrics. “ “ —Christian Prokopp, Principal Consultant at Big Data Partnership See in List Form
  • 21.
    Sharing is Caring Measure and record everything, and keep an eye on your key metrics. Things change, and tests become obsolete, and sometimes in surprising ways especially when you depend on external data. For example, data sources you mine may introduce rolling changes, which are hard to catch as an error but easy to identify in metrics. “ “ —Idan Tendler, CEO of Fortscale See in List Form
  • 22.
    Encrypting data atrest is a good best practice. Encryption “ “ —Minesh Patel, Qubole See in List Form
  • 23.
    Pick the RightDistribution A common question is whether to go for a distribution from Apache or a vendor. When there is enough expertise in the organization to know the internals of the different frameworks for integrating and resolving any issues quickly, then go with Apache Hive. If that expertise is not available, use a distribution through a vendor and get commercial support to resolve any issues that may arise. “ “ —Praveen Sripati, Hadoop trainer and author of Dattamsha See in List Form
  • 24.
    Developing a BigData strategy is all about starting small and making gradual steps in becoming more data-driven. Start with breaking down the data silos within your organization to gain the most insights from your data when you start analyzing it through a variety of tools. Start Small “ “ —Mark van Rijmenam – CEO / Founder BigData-Startups See in List Form
  • 25.
    Have a BusinessIntent There is often a perception that there is gold in an organization’s data, and that if you just look hard enough, you will find it. In reality, this perception can lead to fruitless efforts with no real direction and no payoff. Instead, start with a business intent in mind. What are the actions you would take—and the value to your business—if data can provide the answer to a certain question? “ “ —Sean Stauth, Director, Client Services, Silicon Valley Data Science See in List Form
  • 26.
    Update Your Strategy Your data strategy should be a living document that helps you get the most value from your data. As your goals, your technical environment, or the market change, keep it updated to help you follow those changes and stay on course. “ “ —Scott Kurth, VP, Advisory Services, Silicon Valley Data Science See in List Form
  • 27.
  • 28.
    Data Science Mindset Have an always-on data science mindset — Successful big data initiatives start with a holistic 360 view of the problem space. This includes understanding the inputs (data types, sources, features), the desired outputs (decisions, goals, predictions), and the constraints (model parameters, boundary conditions, optimization constraints). To achieve this perspective, one must be thinking like a scientist from start to finish: collect data, infer a testable hypothesis, design an experiment, test and evaluate the results, refine your hypothesis, and repeat (if necessary). “ “ —Kirk Borne, Data Scientist, Astrophysicist and Big Data Science Consultant See in List Form
  • 29.
    Return on Innovation The most important ROI in Big Data Analytics projects is Return On Innovation. What are you doing that’s different and consequential? What sets you apart from the rest of the multitudes in this space? “ “ —Kirk Borne, Data Scientist, Astrophysicist and Big Data Science Consultant See in List Form
  • 30.
    Focus on theUsers Developing a big data platform requires focusing on the users. Serve a few users well, and let their processing scale up with your capabilities. “Premature platformization” or trying to satisfy too many use cases too early in the project leads to failures. Make the initial users successful, and the ecosystem will thrive and grow. “ “ —Owen O’Malley – Sr. Architect and Co-founder of Hortonworks See in List Form
  • 31.
    Using the API:samples for Java SDK, Python SDK, and REST. Use the API “ “ —Minesh Patel, Qubole See in List Form
  • 32.
    Take Real-Time Action If you cannot take real-time action, you have no need of real-time processing. There will always be batch processing workloads supporting the enterprise, and increasingly dynamic decision areas can be effectively supported by analytical systems because of advances in data architectures. “ “ —Sanjay Mathur, CEO, Silicon Valley Data Science See in List Form
  • 33.
    Store Denormalized State State—the full context of an event, like a customer visit or the completion of a step in a manufacturing process—can be expensive to reassemble after the fact. This is particularly true with highly relational systems: witness the complex ETL (extract, transform, load) workloads that enterprise data warehouse systems struggle to scale. Storing denormalized state, e.g. rich logs, for analysis has proven highly successful for the web businesses of silicon valley, and those techniques can be applied to industries across the economy. “ “ —John Akred, CTO, Silicon Valley Data Science See in List Form
  • 34.
    Build a CommonPlatform Whether you are thinking about migrating towards Big Data or whether you are just starting out with data all together, it helps to focus upon building and maintaining a common platform. Similar to software development platforms, data platforms should also include source control, change management, and testing scenarios. This will help reduce future migration costs and will lead to long-term sustainable, competitive data capabilities. “ “ —Ryan Kirk, SR. Data Scientist at Hipcricket See in List Form
  • 35.
    Looking for additionalbig data tips and advice? Subscribe to Qubole's email newsletter. Sources: http://www.qubole.com/new-series-big-data-tips/ http://www.qubole.com/setup-is-key/ http://www.qubole.com/evaluate-and-adapt/ http://www.qubole.com/data-mindset/