Is the elephant in the roomPresentation Transcript
Is the Elephant in the room? Regunath B firstname.lastname@example.org Twitter : @RegunathB
Quick read 1.8 million words?The story is about a battle between great kings and sons, with the principal characters beingArjuna, Pandu, Bhishma, Bharata, Karna, Duryodhana, Yudhishthira etc. Source : The Gramener blog for visualizations – Analysis of the entire text contained in the Mahabharatha (http://blog.gramener.com/category/visualisations)
Insights from Social Media Source : ttwick Billionaires page (Bill Gates Twitter Social Media profile) (http://ttwick.com/blog/bill-gates-twitter-social-media/)
Insights from Social Media Source : Impact page of Satyamevjayate (http://www.satyamevjayate.in/impact/impact.php/)
What is Big Data?● Big Data challenges and opportunities arise when information in an enterprise demonstrates following characteristics: – Volume ● Transaction data from enterprise systems – For example : Financial transactions, Orders – Variety ● Structured and Unstructured data – For example : Customer contact, Social Media, Biometrics – Velocity ● High information arrival rates – For example : Application events, Tagging, Rating of content● Big Data opportunities arise when the enterprise is able to derive Value from the data characteristics defined above
Food for thought.... on theorems and laws● Do hardware and technology trends affect your technology selection? – CPU, RAM and disk size double every 18-24 months [Moore’s law] – Disk seek time remains nearly constant at around 5% speed-up per year● Data Seek vs. Data transfer – Software that leverage one of the above (or) a combination B+ tree index, LSM tree index, “Fractal tree”● CAP theorem effect – ability to achieve only 2 of 3 properties of shared- data systems : data Consistency, system Availability and tolerance to network Partitions● Bandwidth is the most scare commodity in a Data Center
Aadhaar Patterns & Technologies• Principles • POJO based application implementation • Light-weight, custom application container • Http gateway for APIs• Compute Patterns • Data Locality • Distribute compute (within a OS process and across)• Compute Architectures • SEDA – Staged Event Driven Architecture • Master-Worker(s) Compute Grid• Data Access types • High throughput streaming : bio-dedupe, analytics • High volume, moderate latency : workflow, UID records • High volume , low latency : auth, demo-dedupe, search – eAadhaar, KYC
Aadhaar Architecture • Real-time monitoring using Events• Work distribution using SEDA & Messaging• Ability to scale within JVM and across• Recovery through check-pointing• Sync Http based Auth gateway• Protocol Buffers & XML payloads• Sharded clusters • Near Real-time data delivery to warehouse • Nightly data-sets used to build dashboards, data marts and reports
Putting data to work at Aadhaar
Big Data at Flipkart ● Website traffic – Millions of page hits per day – product catalogs, item availability, promotions, search – Millions of active sessions and shopping carts – Latencies measured in low digit milliseconds ● Growing list of categories (Books, Mobiles, Toys, Personal,Home,Baby, Digital music...) – Electronic inventory – MP3, eBooks, movies ● New business models, newer channels ● Understanding users, user profiles, social media, experience – Tera bytes of logs containing browsing behavior, data from multiple engagement channels – Recommendations based on millions of possible item matches and relevance algorithms
Is the Elephant in the room?From Wikipedia:"Elephant in the room" is an English metaphorical idiom for an obvious truth that is being ignoredor goes unaddressed.Big Data opportunities and challenges are real and present -It is the Elephant in the room.
Some takeaways from experience● Make everything API based● Everything fails (hardware, software, network, storage) – System must recover, retry transactions, and sort of self-heal● Security and privacy should not be an afterthought● Scalability does not come from one product – Watch out for solution and technology stereotyping● Open scale out is the only way to go – Heterogeneous, multi-vendor, commodity compute, growing linear fashion. Nothing else can adapt!