Expert Big Data Tips

Table of Contents
Setup is Key
Think wide
Tool integration
Evaluate and Adapt
Sharing
Encryption
A data science mindset
Innovation
Real-time action

To see all of the tips in list
form, click the button on the
bottom of the slide.
See in List Form

Grant Unlimited Access
Create a data lake and give your business and
data analysts access to all your data –
structured and unstructured – with SQL engines
like Hive. They will surprise you with the insight
and value they can extract, and your
development team will have less work
answering ad-hoc queries.
“ “
—Christian Prokopp, Principal Consultant at Big Data Partnership
See in List Form

Select the Right Tools
Very often the query is when to use
MapReduce/Pig/Hive vs. HBase/Cassandra/Impala
frameworks. NFR (Non Functional Requirements)
have to be considered while deciding the
framework. MapReduce/Pig/Hive are used for high
throughput/high latency requirements as in the
case of Batch processing/ETL.
HBase/Cassandra/Impala are used for low
throughput/low latency requirements as in the case
of a customer filling out an online application.
“ “
—Praveen Sripati, Hadoop trainer and author of Hadoop Tips
See in List Form

Improve query performance by considering
Presto with RCFile or ORC File format.
Use Presto
“ “
—Minesh Patel, Qubole
See in List Form

Incorporate Machine
Learning
Use Robust Machine Learning Algorithms to
extract the data – Data collection and massive
storing is only the enabling infrastructure. You
should leverage existing and also propriety
machine learning algorithms, that will discover
hidden patterns, and will learn from the data
what is important for the analyst to view and
examine, and what is not.
“ “
—Idan Tendler, CEO of Fortscale
See in List Form

Automation is Key
There is a big need for automation in Big Data.
Security is an important industry that has
proven the value of Big Data. But, that has just
as quickly proved that Big Data is also valueless
without automation wrapped around it to make
it practical. Only once you make Big Data
practical can you begin to perform analytics,
etc., which is where the value of Big Data in the
security industry really gets unlocked.
“ “
—Sean Brady, VP of Product Management at Vorstack
See in List Form

Identify Easy Wins
Segment the data based on demographic
and/or firmographic information. This is an easy
and inexpensive way to highlight trends in the
primary customers and industries served. This
information is very helpful when determining
what new products and/or services should be
offered. In addition, look for trends in
behavioral transaction information and further
optimize the customer’s experience with
relevant marketing and messaging.
“ “
—David Handmaker, CEO of Next Day Flyers
See in List Form

Think Broad
Identify all of the data you have access to and/or will
produce, and explore possible audiences and use
cases for it. Often times, big data plays are geared
toward a fairly narrow audience and set of use
cases based on the original inspiration for the
solution. Or, there is not an active and explicit
exploration of the full potential of what you have to
offer. I can all but assure you that there are major
opportunities for your offering that you haven’t
even considered yet. The earlier you have a crisp
view of the potential of your big data and offering,
the better able you will be to build the right thing, in
the right way, to exploit the potential of that idea.
“ “
—Dirk Knemeyer, founder of Involution Studios
See in List Form

Careful and Smart Integration
with BI tools
Big Data tools ( Mapreduce/Hive etc. ) are known for
their latency problems, but on the other hand they are
excellent for processing petabytes of data in a
distributed computing environment. When it comes to
integration with any BI/reporting tools, big data
technologies should be used in an appropriate manner
so that you can avoid the negatives and leverage the
strength of these technologies.
For example – if you are building an integrated pipeline
with BI tools, try to aggregate as much as you can and
utilize the caching or cube technologies with the BI tools
to make it a faster experience for the end user. Real
time connectivity with big data sources like Hive/HDFS is
not a great end user experience in the BI space, so it
should be avoided.
“ “
—Ashish Dubey, Solutions Architect at Qubole
See in List Form

Invest in Your Pipeline
Rule of thumb, invest 80% of your time in your
data lake and data pipeline (mining, extracting,
cleaning, transforming, loading), and 20% in the
high level data science and machine learning
effort. Data in the wild is complex, wrong,
contradicting, hard to access and find.
Consequently more, faster, and accurate data
usually has a higher impact than more complex
models and makes for a robust system.
“ “
See in List Form

Don’t Rush Into Analysis
Everyone with a Big Data project wants to rush
straight into analysis. That is where things
usually fall apart, however, because there is
simply too much data flowing across the
network and it is mostly in a format that
current analytics software cannot handle.
“ “
—Rick Aguirre, president of Cirries Technologies
See in List Form

Start with Heavy Lifting
Big Data success requires three steps of heavy lifting first,
before you ever analyze it.
Step 1 is data capture.
Most of the Big Data torrent is a big nothing and not relevant.
Decide what data you want to analyze and set up algorithms to
locate and corral it.
“ “
Step 1 is data control.
You want to capture the data you need as it come
across the network. It may not be relevant in just a few minutes,
or you may need to store it for a number of years if, as one
example, it is data that might be needed later for law
enforcement purposes.
Step 1 is data humanization.
This is where you convert whatever format the data is in to a
format that your analytics software can use. Only now, at this
step, do you have the right data in the right format that you can
then use for whatever kind of analytics you have in mind.
—Rick Aguirre, president of Cirries Technologies
See in List Form

Once data is collected then you have easy
access for advanced analytics – don’t stop at
only analyzing one log source or one dimension
of data – analyze across log sources and
multiple entities. For example, in order to
discover advanced cyber attacks that leveraged
users’ credentials, we profile across behavioral
activity of users – including their permissions
configuration, their access to files and systems
and their web activity. We analyze their
historical activity as well as comparing them
against their peers.
Think wide
“ “
See in List Form

Use the ODBC Driver
Perform BI Analytics and Visualization
with the ODBC Driver.
“ “
See in List Form

Use a Subsample
I always start by looking at a subsample of the
data. You often get a very good impression of
what the main focus of the data munging or
cleaning will be just by looking at some
numbers (or characters).
“ “
—Benedikt Koehler, Data Scientist and Blogger at Beautiful Data
See in List Form

Measure Everything
Measure and record everything, and keep an
eye on your key metrics. Things change, and
tests become obsolete, and sometimes in
surprising ways especially when you depend
on external data. For example, data sources
you mine may introduce rolling changes, which
are hard to catch as an error but easy to
identify in metrics.
“ “
See in List Form

Sharing is Caring
Measure and record everything, and keep an
eye on your key metrics. Things change, and
tests become obsolete, and sometimes in
surprising ways especially when you depend
on external data. For example, data sources
you mine may introduce rolling changes, which
are hard to catch as an error but easy to
identify in metrics.
“ “
See in List Form

Encrypting data at rest is a good
best practice.
Encryption
“ “
See in List Form

Pick the Right Distribution
A common question is whether to go for a
distribution from Apache or a vendor. When
there is enough expertise in the organization to
know the internals of the different frameworks
for integrating and resolving any issues quickly,
then go with Apache Hive. If that expertise is
not available, use a distribution through a
vendor and get commercial support to resolve
any issues that may arise.
“ “
—Praveen Sripati, Hadoop trainer and author of Dattamsha
See in List Form

Developing a Big Data strategy is all about
starting small and making gradual steps in
becoming more data-driven. Start with
breaking down the data silos within your
organization to gain the most insights from
your data when you start analyzing it
through a variety of tools.
Start Small
“ “
—Mark van Rijmenam – CEO / Founder BigData-Startups
See in List Form

Have a Business Intent
There is often a perception that there is gold in
an organization’s data, and that if you just look
hard enough, you will find it. In reality, this
perception can lead to fruitless efforts with no
real direction and no payoff. Instead, start with
a business intent in mind. What are the actions
you would take—and the value to your
business—if data can provide the answer to a
certain question?
“ “
—Sean Stauth, Director, Client Services, Silicon Valley Data Science
See in List Form

Update Your Strategy
Your data strategy should be a living document
that helps you get the most value from your
data. As your goals, your technical environment,
or the market change, keep it updated to help
you follow those changes and stay on course.
“ “
—Scott Kurth, VP, Advisory Services, Silicon Valley Data Science
See in List Form

Data Science Mindset
Have an always-on data science mindset —
Successful big data initiatives start with a holistic
360 view of the problem space. This includes
understanding the inputs (data types, sources,
features), the desired outputs (decisions, goals,
predictions), and the constraints (model
parameters, boundary conditions, optimization
constraints). To achieve this perspective, one must
be thinking like a scientist from start to finish:
collect data, infer a testable hypothesis, design an
experiment, test and evaluate the results, refine
your hypothesis, and repeat (if necessary).
“ “
—Kirk Borne, Data Scientist, Astrophysicist and
Big Data Science Consultant
See in List Form

Return on Innovation
The most important ROI in Big Data Analytics
projects is Return On Innovation. What are you
doing that’s different and consequential? What
sets you apart from the rest of the multitudes in
this space?
“ “
—Kirk Borne, Data Scientist, Astrophysicist and
Big Data Science Consultant
See in List Form

Focus on the Users
Developing a big data platform requires focusing
on the users. Serve a few users well, and let their
processing scale up with your capabilities.
“Premature platformization” or trying to satisfy too
many use cases too early in the project leads to
failures. Make the initial users successful, and the
ecosystem will thrive and grow.
“ “
—Owen O’Malley – Sr. Architect and Co-founder of Hortonworks
See in List Form

Using the API: samples for Java SDK,
Python SDK, and REST.
Use the API
“ “
See in List Form

Take Real-Time Action
If you cannot take real-time action, you have
no need of real-time processing. There will
always be batch processing workloads
supporting the enterprise, and increasingly
dynamic decision areas can be effectively
supported by analytical systems because of
advances in data architectures.
“ “
—Sanjay Mathur, CEO, Silicon Valley Data Science
See in List Form

Store Denormalized State
State—the full context of an event, like a
customer visit or the completion of a step in
a manufacturing process—can be expensive
to reassemble after the fact. This is
particularly true with highly relational
systems: witness the complex ETL (extract,
transform, load) workloads that enterprise
data warehouse systems struggle to scale.
Storing denormalized state, e.g. rich logs, for
analysis has proven highly successful for the
web businesses of silicon valley, and those
techniques can be applied to industries
across the economy.
“ “
—John Akred, CTO, Silicon Valley Data Science
See in List Form

Build a Common Platform
Whether you are thinking about migrating
towards Big Data or whether you are just
starting out with data all together, it helps to
focus upon building and maintaining a
common platform. Similar to software
development platforms, data platforms
should also include source control, change
management, and testing scenarios. This will
help reduce future migration costs and will
lead to long-term sustainable, competitive
data capabilities.
“ “
—Ryan Kirk, SR. Data Scientist at Hipcricket
See in List Form

Looking for additional big data tips and advice?
Subscribe to Qubole's email newsletter.
Sources:
http://www.qubole.com/new-series-big-data-tips/
http://www.qubole.com/setup-is-key/
http://www.qubole.com/evaluate-and-adapt/
http://www.qubole.com/data-mindset/

Expert Big Data Tips

More Related Content

What's hot

Viewers also liked

Similar to Expert Big Data Tips

More from Qubole

Recently uploaded

Expert Big Data Tips