This document discusses big data consulting and Hadoop. It provides an overview of how the information age and rise of connectivity has led to new opportunities for using big data and Hadoop. Specifically, it discusses how Hadoop can be used for storage, search, messaging, targeting, business intelligence, data warehousing, machine learning, and device management. It also addresses questions around the maturity of Hadoop and discusses security features.
3. The information age
■ The “economic third wave” has badly hit many blue chip
organisations
■ Manufacturing and retail is in rapid decline in Europe and the US
■ Tech, connectivity and information is restructuring our societies
■ Levels of political and social engagement have surged
■ Trading platforms are empowering small businesses
4.
5. Innovation
■ Mass-production hates innovation
■ Innovation means change – a huge cost with little benefit for
production-line economies
■ Continuous improvement mentality
■ Knowledge services need to innovate to differentiate
■ Change in a virtual world can be cheap and yield huge rewards
■ Continuous reinvention mentality
7. Big data viz. innovation
■ In a free market like the web, innovation can open up new
opportunities
■ Consumer access to grid computing tech is a recent innovation
■ Grid computing opens up new opportunities that would otherwise
not be viable
■ Ideal for ventures architected around the long-tail economic
model
8. The future - thingternet
■ The internet of things is with us
■ Billions of connected devices, even digital tattoos
9. Big data viz. internet of things
■ Billions of connected devices create a huge amount
of data
■ Until big data tech, Internet of Things was nearly
impossible to monetize
10. The internet of things is a wild west
■ Many new, unsolved challenges
■ Privacy
■ Governance
■ Civil liberties
■ New challenges = new opportunities
12. ■ FOSS software solution for processing terabytes to petabytes of data
■ Using arrays of regular servers
■ Hadoop core:
■ HDFS - a scale-out file system
■ YARN - a scale-out application resource manager
■ Runtimes:
■ Spark, Impala, Flink, MapReduce, Kafka, SolrCloud etc.
■ Components for data protection, access control and operational management
■ NOSQL databases
■ Hbase, Accumulo, Cassandra etc.
Hadoop refresher
14. Storage
■ Pure online data storage, with no other processing
■ Low cost per-GB for petascale online storage
■ Option to directly query and analyse the data is
available if required.
15. ■ Example: huge, constantly changing catalogue of
products – like Ebay and Amazon
■ SolrCloud – an advanced search engine serving
terabytes of content from Hadoop
Search
16. Messaging
■ A distributed message queue backed by a Hadoop
cluster - Apache Kafka
■ Elastically scalable
■ Messages are persisted and replicated for durability
■ TBs of messages per broker with predictable
performance
17. Targeting
■ Personalised content for users
■ Generates and consumes a huge amount of log data
■ for reporting
■ for predictive analysis
■ Predictive analysis is compute intensive
■ Can be TBs of data per day
18. Self-service Business Intelligence
■ Enterprise Data Hub paradigm
■ A very popular emerging use case
■ Business users directly access raw datasets
using specialised discovery tools built on top of
Hadoop - DataMeer, Platfora and others
19. Data warehousing
■ Migration of Enterprise Data Warehouse to Hadoop
■ Big cost savings versus trad vendors like Oracle and
Teradata
20. Machine learning
■ Predictive analytics with Spark MLLib or
Revolution R Enterprise
■ Automatically predict component failures for
proactive intervention
21. Big Database
■ Low latency, high throughput, high concurrency,
high volume
■ Algotrading
■ Realtime ad auctions
■ Volumes at 200BN transactions per day in realtime
reliably served
22. ■ Analysis and response to threats detected by SPI
module on remote switch
■ Automated systems management – shut down
heating when nobody home to reduce heating bill and
emissions
■ Monitor driver propensity to break the speed limit -
offer lower insurance premiums to good drivers
Device management
30. Secure and available
■ RPC authentication and encryption with PKI
■ Data encryption at rest and in transit
■ Kerberos resource access control - HDFS, YARN
■ Table cell level permissions - Accumulo
■ Online snapshot backups
■ No SPoF