Andrii Gakhov, author of the book Probabilistic Data Structures and Algorithms for Big Data Applications talks about probabilistic data structures and their application to the big data domain. Host Robert Blumen spoke with Dr. Gakhov about how probabilistic data structures differ from their exact counterparts; hash functions – cryptographic and non-cryptographic; space versus accuracy tradeoffs; space versus processing time tradeoffs; the main problem domains: membership testing, cardinality, frequency, similarity and rank. Bloom Filters for membership testing: performance characteristics, use cases, design patterns using Bloom Filters for lookup problems; and how they are implemented. LinearCount and HyperLogLog for cardinality: use cases web applications, implementation. CountMinSketch for frequency estimation. Existing library support. Should PDS be taught in beginning courses?
Show Notes
Related Links
- Probabilistic Data Structures and Algorithms in Python library
- Probabilistic Data Structures for C# library
- Book Probabilistic Data Structures and Algorithms for Big Data Applications by Dr. Andrii Gakhov
- PROBABILISTIC DATA STRUCTURES FOR WEB ANALYTICS AND DATA MINING by Ilya Katsov
- Probabilistic data Structures – Bloom filter and HyperLogLog for Big Data
- Bloom filters in Apache Cassandra
- Andrii Gakhov slideshare Probabilistic Data Structures: All you wanted to know but were afraid to ask Part 1: Membership
- Andrii Gakhov slideshare Probabilistic Data Structures: All you wanted to know but were afraid to ask Part 2: Cardinality
- Andrii Gakhov slideshare Probabilistic Data Structures: All you wanted to know but were afraid to ask Part 3: Frequency
- Andrii Gakhov slideshare Probabilistic Data Structures: All you wanted to know but were afraid to ask Part 4: Similarity Membership
- Guest twitter: https://twitter.com/gakhov
- Guest email: [email protected]
This was really worth spending time on !
Thanks for sharing