In this presentation Yekasa Kosuru talks about challenges associated with Big Data at Nokia. As well as discussing the challenges, Kosuru also talks through solutions that Nokia use across the different platforms they use there. The solutions are broken into phases which Kosuru talks through in detail with the use of stats and flow charts.
2. • Phases of Big Data Challenges @Nokia
– Who we are
– Big data platform
– Use case data flows
– High level architecture
–Challenges
• Phases of challenges
Agenda
22
5. 5
Big Data
Analytics
…to Be Made
Available for Analysis
Enabling feedback loops for continuous improvement,
Location Optimized Experience, CRM, etc..!
Big Data Flows and Differentiates
…on All Supported
Platforms…
Nokia
Account
We Collect
User Data…
5
6. Click to edit Master title style
Phase 0
66
2008 – ‘10
BuildTechnology
Platform,
GetData
7. 7
Business Challenges
• Data silos, no unique identifiers, missing semantics
• Multiple sources - overlapping, conflicting
• Timely processing of large volumes & velocity of data
• Partial, insufficient, inaccurate, inconsistent.. data
• Data/wire formats, Security, privacy and other policies
unknown
Central Big Data Platform created
8. 8
…to verify Map accuracy and create
Motion Graph
Using different big data sets
9. Reports
Analytical
DBMS
Analytics Cluster
Data Asset
Catalog
Analytical
DBMS
Dashboards
Data Discovery
Interactive
Queries
Batch
Queries
Web Applications
Activity
Logs
VShards
(NoSQL)
Reference Data
Device Applications
Probes
3rd Party
Device
User Profile
POI, Map
Activity
Sensor
DataIntake
ETL,datacrunching,
attribution,ML
Algorithms
Aggregation
HDFS
9
Analytical
DBMS
Big Data Analytics Platform Data Flows
13. 13
2012 Production Statistics
• 10’s PB of data all across Nokia
• Multi-tenant, multi-petabyte analytics cluster
• 10-20K+ jobs per day
• 600+ internal users
• 300M+ KV queries
• Terabytes flowing in every day
• Multiple data centers around the world
14. 14
Challenges With Big Data
• Complex eco-system of technologies - many moving
parts, slower deploy cycles, data integration is complex
• Capacity & Scale Issues – Provision for peaks or sustained,
storage or compute ?
• DBMS great for performance & data management, but
cant scale - price/performance & ACIDity
• Hadoop great for ETL, but poor on query performance &
data management, not interactive
• Data and Metadata fragmentation
15. 15
Big Data Capacity Issues
• Spikey Workloads
• Capacity Provisioning
– Peaks
– Sustained loads
• How many clusters ?
– SLA/Adhoc/Research
– Multiple data centers
– Data duplication
• Tenancy – single/multi
• TOC
– Hadoop can get expensive -
storage & computed tightly
coupled, idle machines
16. 16
Cloud helps with some issues
• Operational & IT complexity reduced – API based spin up
& tear down – rapid deployments, faster cycles
• Pay for what is used
• Capacity issues mitigated - idle machines or peaks not
an issue – elastically scale up and down
• De-coupled Storage and Compute makes sense
• Stateless architecture, recycle slow/bad machines, no
need for rolling upgrades, instead do rolling replace
18. 18
Still Pending
• Data and Metadata fragmentation, need deeper
integration into all tools/frameworks
• Advanced Analytics - Data science problems are hard &
inefficient to implement in Map Reduce/RDBMS
19. 19
Complex Analytics
• Mathematicians think terms of Arrays not Map Reduce
• Data science tools can’t efficiently handle big data
• Data partitioning is naïve, indexing wont scale