Data Science Machine Learning Data Analysis
37.1K subscribers
1.13K photos
27 videos
39 files
1.24K links
This channel is for Programmers, Coders, Software Engineers.

1- Data Science
2- Machine Learning
3- Data Visualization
4- Artificial Intelligence
5- Data Analysis
6- Statistics
7- Deep Learning

Cross promotion and ads: @hussein_sheikho
加入频道
📚 Data Engineering Made Simple (2024)

1⃣ Join Channel Download:
https://yangx.top/+MhmkscCzIYQ2MmM8

2⃣ Download Book: https://yangx.top/c/1854405158/1865

💬 Tags: #DataEngineering

USEFUL CHANNELS FOR YOU ⭐️
📚 Financial Data Engineering (2024)

1⃣ Join Channel Download:
https://yangx.top/+MhmkscCzIYQ2MmM8

2⃣ Download Book: https://yangx.top/c/1854405158/2145

💬 Tags: #DataEngineering

USEFUL CHANNELS FOR YOU ⭐️
👍124🔥1
Polars.pdf
391.5 KB
📖 A comprehensive cheat sheet for working with Polars


🌟 Have you ever worked with pandas and thought that was the fastest way? I thought the same thing until I worked with Polars.

✏️ This cheat sheet explains everything about Polars in a concise and simple way. Not just theory! But also a bunch of real examples, practical experience, and projects that will really help you in the real world.

🐻‍❄️ Polars Cheat Sheet
♾️ Google Colab
📖 Doc

#Polars #DataEngineering #PythonLibraries #PandasAlternative #PolarsCheatSheet #DataScienceTools #FastDataProcessing #GoogleColab #DataAnalysis #PythonForDataScience

✉️ Our Telegram channels: https://yangx.top/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
8👍1
𝗬𝗼𝘂𝗿_𝗗𝗮𝘁𝗮_𝗦𝗰𝗶𝗲𝗻𝗰𝗲_𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄_𝗦𝘁𝘂𝗱𝘆_𝗣𝗹𝗮𝗻.pdf
7.7 MB
1. Master the fundamentals of Statistics

Understand probability, distributions, and hypothesis testing

Differentiate between descriptive vs inferential statistics

Learn various sampling techniques

2. Get hands-on with Python & SQL

Work with data structures, pandas, numpy, and matplotlib

Practice writing optimized SQL queries

Master joins, filters, groupings, and window functions

3. Build real-world projects

Construct end-to-end data pipelines

Develop predictive models with machine learning

Create business-focused dashboards

4. Practice case study interviews

Learn to break down ambiguous business problems

Ask clarifying questions to gather requirements

Think aloud and structure your answers logically

5. Mock interviews with feedback

Use platforms like Pramp or connect with peers

Record and review your answers for improvement

Gather feedback on your explanation and presence

6. Revise machine learning concepts

Understand supervised vs unsupervised learning

Grasp overfitting, underfitting, and bias-variance tradeoff

Know how to evaluate models (precision, recall, F1-score, AUC, etc.)

7. Brush up on system design (if applicable)

Learn how to design scalable data pipelines

Compare real-time vs batch processing

Familiarize with tools: Apache Spark, Kafka, Airflow

8. Strengthen storytelling with data

Apply the STAR method in behavioral questions

Simplify complex technical topics

Emphasize business impact and insight-driven decisions

9. Customize your resume and portfolio

Tailor your resume for each job role

Include links to projects or GitHub profiles

Match your skills to job descriptions

10. Stay consistent and track progress

Set clear weekly goals

Monitor covered topics and completed tasks

Reflect regularly and adapt your plan as needed


#DataScience #InterviewPrep #MLInterviews #DataEngineering #SQL #Python #Statistics #MachineLearning #DataStorytelling #SystemDesign #CareerGrowth #DataScienceRoadmap #PortfolioBuilding #MockInterviews #JobHuntingTips


✉️ Our Telegram channels: https://yangx.top/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
7👍4
𝗦𝘆𝘀𝘁𝗲𝗺_𝗗𝗲𝘀𝗶𝗴𝗻_𝗥𝗼𝗮𝗱𝗺𝗮𝗽_𝗳𝗼𝗿_𝗠𝗔𝗔𝗡𝗚_&_𝗕𝗲𝘆𝗼𝗻𝗱.pdf
12.5 MB
𝗦𝘆𝘀𝘁𝗲𝗺 𝗗𝗲𝘀𝗶𝗴𝗻 𝗥𝗼𝗮𝗱𝗺𝗮𝗽 𝗳𝗼𝗿 𝗠𝗔𝗔𝗡𝗚 & 𝗕𝗲𝘆𝗼𝗻𝗱 🚀
If you're targeting top product companies or leveling up your backend/system design skills, this is for you.

System Design is no longer optional in tech interviews. It’s a must-have.
From Netflix, Amazon, Uber, YouTube, Reddit, Inc., to Twitter, these case studies and topic breakdowns will help you build real-world architectural thinking.

📌 Save this post. Spend 40 mins/day. Stay consistent.


➊ 𝗠𝘂𝘀𝘁-𝗞𝗻𝗼𝘄 𝗖𝗼𝗿𝗲 𝗖𝗼𝗻𝗰𝗲𝗽𝘁𝘀

👉 System Design Basics
🔗 https://bit.ly/3SuUR0Y)

👉 Horizontal & Vertical Scaling
🔗 https://bit.ly/3slq5xh)

👉 Load Balancing & Message Queues
🔗 https://bit.ly/3sp0FP4)

👉 HLD vs LLD, Hashing, Monolith vs Microservices
🔗 https://bit.ly/3DnEfEm)

👉 Caching, Indexing, Proxies
🔗 https://bit.ly/3SvyVDc)

👉 Networking, CDN, How Browsers Work
🔗 https://bit.ly/3TOHQRb

👉 DB Sharding, CAP Theorem, Schema Design
🔗 https://bit.ly/3CZtfLN

👉 Concurrency, OOP, API Layering
🔗 https://bit.ly/3sqQrhj

👉 Estimation, Performance Optimization
🔗 https://bit.ly/3z9dSPN

👉 MapReduce, Design Patterns
🔗 https://bit.ly/3zcsfmv

👉 SQL vs NoSQL, Cloud Architecture
🔗 https://bit.ly/3z8Aa49)


➋ 𝗠𝗼𝘀𝘁 𝗔𝘀𝗸𝗲𝗱 𝗦𝘆𝘀𝘁𝗲𝗺 𝗗𝗲𝘀𝗶𝗴𝗻 𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻𝘀

🔗 https://bit.ly/3Dp40Ux
🔗 https://bit.ly/3E9oH7K


➌ 𝗖𝗮𝘀𝗲 𝗦𝘁𝘂𝗱𝘆 𝗗𝗲𝗲𝗽 𝗗𝗶𝘃𝗲𝘀 (𝗣𝗿𝗮𝗰𝘁𝗶𝗰𝗲 𝗧𝗵𝗲𝘀𝗲!)

👉 Design Netflix
🔗 https://bit.ly/3GrAUG1

👉 Design Reddit
🔗 https://bit.ly/3OgGJrL

👉 Design Messenger
🔗 https://bit.ly/3DoAAXi

👉 Design Instagram
🔗 https://bit.ly/3BFeHlh

👉 Design Dropbox
🔗 https://bit.ly/3SnhncU

👉 Design YouTube
🔗 https://bit.ly/3dFyvvy

👉 Design Tinder
🔗 https://bit.ly/3Mcyj3X

👉 Design Yelp
🔗 https://bit.ly/3E7IgO5

👉 Design WhatsApp
🔗 https://bit.ly/3M2GOhP

👉 Design URL Shortener
🔗 https://bit.ly/3xP078x

👉 Design Amazon Prime Video
🔗https://bit.ly/3hVpWP4

👉 Design Twitter
🔗 https://bit.ly/3qIG9Ih

👉 Design Uber
🔗 https://bit.ly/3fyvnlT

👉 Design TikTok
🔗 https://bit.ly/3UUlKxP

👉 Design Facebook Newsfeed
🔗 https://bit.ly/3RldaW7

👉 Design Web Crawler
🔗 https://bit.ly/3DPZTBB

👉 Design API Rate Limiter
🔗 https://bit.ly/3BIVuh7


➍ 𝗙𝗶𝗻𝗮𝗹 𝗦𝘆𝘀𝘁𝗲𝗺 𝗗𝗲𝘀𝗶𝗴𝗻 𝗥𝗲𝘀𝗼𝘂𝗿𝗰𝗲𝘀

👉 All Solved Case Studies
🔗 https://bit.ly/3dCG1rc

👉 Design Terms & Terminology
🔗 https://bit.ly/3Om9d3H

👉 Complete Basics Series
🔗https://bit.ly/3rG1cfr

#SystemDesign #TechInterviews #MAANGPrep #BackendEngineering #ScalableSystems #HLD #LLD #SoftwareArchitecture #DesignCaseStudies #CloudArchitecture #DataEngineering #DesignPatterns #LoadBalancing #Microservices #DistributedSystems


✉️ Our Telegram channels: https://yangx.top/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
👍31🔥1
Topic: Python PySpark Data Sheet – Part 1 of 3: Introduction, Setup, and Core Concepts

---

### 1. What is PySpark?

PySpark is the Python API for Apache Spark, a powerful distributed computing engine for big data processing.

PySpark allows you to leverage the full power of Apache Spark using Python, making it easier to:

• Handle massive datasets
• Perform distributed computing
• Run parallel data transformations

---

### 2. PySpark Ecosystem Components

Spark SQL – Structured data queries with DataFrame and SQL APIs
Spark Core – Fundamental engine for task scheduling and memory management
Spark Streaming – Real-time data processing
MLlib – Machine learning at scale
GraphX – Graph computation

---

### 3. Why PySpark over Pandas?

| Feature | Pandas | PySpark |
| -------------- | --------------------- | ----------------------- |
| Scale | Single machine | Distributed (Cluster) |
| Speed | Slower for large data | Optimized execution |
| Language | Python | Python on JVM via Py4J |
| Learning Curve | Easier | Medium (Big Data focus) |

---

### 4. PySpark Setup in Local Machine

#### Install PySpark via pip:

pip install pyspark


#### Start PySpark Shell:

pyspark


#### Sample Code to Initialize SparkSession:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("MyApp") \
.getOrCreate()


---

### 5. RDD vs DataFrame

| Feature | RDD | DataFrame |
| ------------ | ----------------------- | ------------------------------ |
| Type | Low-level API (objects) | High-level API (structured) |
| Optimization | Manual | Catalyst Optimizer (automatic) |
| Usage | Complex transformations | SQL-like operations |

---

### 6. Creating DataFrames

#### From Python List:

data = [("Alice", 25), ("Bob", 30)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()


#### From CSV File:

df = spark.read.csv("file.csv", header=True, inferSchema=True)
df.show()


---

### 7. Inspecting DataFrames

df.printSchema()     # Schema info  
df.columns # List column names
df.describe().show() # Summary stats
df.head(5) # First 5 rows


---

### 8. Basic Transformations

df.select("Name").show()
df.filter(df["Age"] > 25).show()
df.withColumn("AgePlus10", df["Age"] + 10).show()
df.drop("Age").show()


---

### 9. Working with SQL

df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE Age > 25").show()


---

### 10. Writing Data

df.write.csv("output.csv", header=True)
df.write.parquet("output_parquet/")


---

### 11. Summary of Concepts Covered

• Spark architecture & PySpark setup
• Core components of PySpark
• Differences between RDD and DataFrames
• How to create, inspect, and manipulate DataFrames
• SQL support in Spark
• Reading/writing to/from storage

---

### Exercise

1. Load a sample CSV file and display the schema
2. Add a new column with a calculated value
3. Filter the rows based on a condition
4. Save the result as a new CSV or Parquet file

---

#Python #PySpark #BigData #ApacheSpark #DataEngineering #ETL

https://yangx.top/DataScienceM
4
Topic: Python PySpark Data Sheet – Part 2 of 3: DataFrame Transformations, Joins, and Group Operations

---

### 1. Column Operations

PySpark supports various column-wise operations using expressions.

#### Select Specific Columns:

df.select("Name", "Age").show()


#### Create/Modify Column:

from pyspark.sql.functions import col

df.withColumn("AgePlus5", col("Age") + 5).show()


#### Rename a Column:

df.withColumnRenamed("Age", "UserAge").show()


#### Drop Column:

df.drop("Age").show()


---

### 2. Filtering and Conditional Logic

#### Filter Rows:

df.filter(col("Age") > 25).show()


#### Multiple Conditions:

df.filter((col("Age") > 25) & (col("Name") != "Alice")).show()


#### Using `when` for Conditional Columns:

from pyspark.sql.functions import when

df.withColumn("Category", when(col("Age") < 30, "Young").otherwise("Adult")).show()


---

### 3. Aggregations and Grouping

#### GroupBy + Aggregations:

df.groupBy("Department").count().show()
df.groupBy("Department").agg({"Salary": "avg"}).show()


#### Using Aggregate Functions:

from pyspark.sql.functions import avg, max, min, count

df.groupBy("Department").agg(
avg("Salary").alias("AvgSalary"),
max("Salary").alias("MaxSalary")
).show()


---

### 4. Sorting and Ordering

#### Sort by One or More Columns:

df.orderBy("Age").show()
df.orderBy(col("Salary").desc()).show()


---

### 5. Dropping Duplicates & Handling Missing Data

#### Drop Duplicates:

df.dropDuplicates(["Name", "Age"]).show()


#### Drop Rows with Nulls:

df.dropna().show()


#### Fill Null Values:

df.fillna({"Salary": 0}).show()


---

### 6. Joins in PySpark

PySpark supports various join types like SQL.

#### Types of Joins:

inner
left
right
outer
left_semi
left_anti

#### Example – Inner Join:

df1.join(df2, on="id", how="inner").show()


#### Left Join Example:

df1.join(df2, on="id", how="left").show()


---

### 7. Working with Dates and Timestamps

from pyspark.sql.functions import current_date, current_timestamp

df.withColumn("today", current_date()).show()
df.withColumn("now", current_timestamp()).show()


#### Date Formatting:

from pyspark.sql.functions import date_format

df.withColumn("formatted", date_format(col("Date"), "yyyy-MM-dd")).show()


---

### 8. Window Functions (Advanced Aggregations)

Used for operations like ranking, cumulative sum, and moving average.

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window_spec = Window.partitionBy("Department").orderBy("Salary")
df.withColumn("rank", row_number().over(window_spec)).show()


---

### 9. Caching and Persistence

Use caching for performance when reusing data:

df.cache()
df.show()


Or use:

df.persist()


---

### 10. Summary of Concepts Covered

• Column transformations and renaming
• Filtering and conditional logic
• Grouping, aggregating, and sorting
• Handling nulls and duplicates
• All types of joins
• Working with dates and window functions
• Caching for performance

---

### Exercise

1. Load two CSV datasets and perform different types of joins
2. Add a new column with a custom label based on a condition
3. Aggregate salary data by department and show top-paid employees per department using window functions
4. Practice caching and observe performance

---

#Python #PySpark #DataEngineering #BigData #ETL #ApacheSpark

https://yangx.top/DataScienceM
2
🔥 Trending Repository: data-engineer-handbook

📝 Description: This is a repo with links to everything you'd ever want to learn about data engineering

🔗 Repository URL: https://github.com/DataExpert-io/data-engineer-handbook

📖 Readme: https://github.com/DataExpert-io/data-engineer-handbook#readme

📊 Statistics:
🌟 Stars: 36.3K stars
👀 Watchers: 429
🍴 Forks: 7K forks

💻 Programming Languages: Jupyter Notebook - Python - Makefile - Dockerfile - Shell

🏷️ Related Topics:
#data #awesome #sql #bigdata #dataengineering #apachespark


==================================
🧠 By: https://yangx.top/DataScienceM