Week 1 – CPT PracticumI work for a client in Health care for a project called Security Data Lake. The main objective of the project is aggregate the logs across all the layers (prod/test/dev) and store them in a centralized location and perform log analysis to see if there’s any potential hacking into systems, data breaching etc., A preferable solution as to store these huge logs was a big data system which can also be used for data processing and Hadoop solved all the requirements for storing and providing high computation and comparatively cheaper which could provide the same requirements.
We currently are collecting logs from all layers of network (routers), Firewalls, Windows machines, Unix machines, Password access management tools, etc., which would give come app. 1-2 TB of data per day with a replication factor of 3 (default) in Hadoop the data usage is app 3-6 TB each day. Our system was designed to store a year worth of data and as well provide enough computation space (CPU/RAM/Disk) for processing on same data for data science and business users on the system for log analysis and threat hunting. Currently systems are configured with 1.6 PB of storage capacity of which ~60% of data is already occupied by data ingestion and with an increase of 3-6 TB each day currently at the same rate the systems are expected to reach 70% in a ~30-40 days. Systems exceeding ~75% of disk utilization is a red as jobs which are processing on the existing data would fail if there’s not enough space for intermediate data.
As it’s time for re-planning on the capacity of the systems, I have proposed a solution with my architects and project team of using the AWS S3 Glaciers buckets for storing the archival data as data older than an year is mostly not utilized for log analysis by data science teams but as per law the requirements were to have them stored for 10 years and having the data on premises would only increase the storage costs. When the same costs are compared with S3 Glacier solution on AWS the costs for storing at current rate of input per annum when calculated considering the input as 2 TB/day for 365 days it was ~$3363 (AWS_cost_glacier). This cost cannot be for completely moving the data as we also need to take into consideration for deploying Hadoop cluster in AWS as for moving data into AWS there are currently two solution available one is to have a cluster deployed in AWS and do a distCp between On-premises Hadoop cluster to Hadoop cluster in AWS with backend storage of S3 buckets for data storage.
Also after proposing above solution of using cloud services for data archival into AWS. It is still in discussion if the solution can be used as the data is critical and also if cloud approach is accepted team would want to check on other cloud providers (Azure, Cloud) and have a details comparison of costs for data retrieval and cluster setup and then have a final decision on which is the best approach taking into consideration of data criticality and also costs for the solutions if cloud based approach is accepted.